DreamDB
A searchable, versioned, distributed memory protocol for the multimodal information of human civilization.
DreamDB is a file and wire protocol for multimodal data — a storage and retrieval specification that any compliant backend (object store, distributed FS, content-addressed network) can implement. The spec describes how multimodal signals (video, audio, text, vectors) are anchored to a shared timeline, written immutably, addressed by their semantic features, and consumed as native streams.
Four first principles
- Time is the sole primary key. Anchor everything to a high-precision timeline. Abolish human-defined IDs.
- Immutability is the bedrock of collaboration. Append-only, content-addressed. New information is layered on, never overwritten.
- Retrieval is probabilistic localization, not scanning. Vector features encode directly into storage paths. Search is a coordinate calculation, not a traversal.
- Data is stream. Encapsulation is streaming-native. The index is the player's seek pointer.
What you get
DreamDB is a Rust reference implementation of the protocol (21 spec docs, ~9000 lines) plus a Python SDK that exposes it as a multimodal versioned dataset:
The same protocol surface is available natively in Rust (dreamdb-dataset crate), via CLI (dreamdb binary for maintenance operations), and inside a browser (a JS implementation in dreamdb-dataset-python/python/dreamdb_dataset/web/ for client-side search demos).
Status (2026-05-18)
| Component | Status | Notes |
|---|---|---|
| Protocol spec (specs 0000–0020) | ✅ v0 complete | 21 docs, ~9000 lines. Multi-parent fused-merge formalized in spec/0008 §5.3 |
Rust SDK (dreamdb-dataset) | ✅ shipping | Dataset/Schema/Sample, ingest, iter_stream, snapshot/branch/merge, delete, tombstones |
| Python bindings | ✅ shipping | PyO3 wheel; PyTorch IterableDataset + Arrow batches |
CLI (dreamdb binary) | ✅ shipping | ada-ivf-step (k8s 4-stage), gc (k8s-shardable), delete, merge-many, inspect, snapshot |
| HTTP connector (MinIO/S3) | ✅ shipping | dreamdb-connector-http; conformance suite passes |
Browser SDK + UI (browse.html) | ✅ shipping | Time-travel viewer, semantic search, IVF+RaBitQ on the client |
| Conformance test vectors | ⚠️ partial | CBOR / address / time / spatial covered (dreamdb-conformance/); cross-SDK interop deferred |
| Billion-scale benchmark | ⏳ in progress | imagenet-1k (1.3M records) ingest running 2026-05-18; SIFT1M / LAION-100M next |
| Test suite | ✅ 721 green | Across all crates; 0 failed |
| Production hardening | ⚠️ pre-1.0 | Observability spec, RBAC, language-binding stability deferred to v0.1+ |
Backends
The HTTP connector talks to any S3-shape endpoint. Choose one:
| Backend | Setup | Auth |
|---|---|---|
| MinIO (local dev) | docker run ... minio/minio + mc anonymous set public | None — unsigned requests |
| AWS S3 | Create bucket; set AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION env vars | SigV4 (auto-detected from env) |
| Cloudflare R2 / Backblaze B2 / Wasabi | Create bucket + S3-compat API token; same env vars as AWS | SigV4 (auto-detected) |
memory:// | Built-in in-process backend | None (testing only) |
The connector detects AWS_* env vars and auto-signs with SigV4 when present. No code changes needed to migrate from local MinIO to production S3 — set the env vars and point at the new backend URL.
60-second quickstart
Requires Rust ≥1.83, Python ≥3.10, Docker (for MinIO), and uv (or any pip-equivalent).
Run with uv run hello_world.py
A 10-minute hands-on walkthrough — schema + ingest + snapshot + query + PyTorch DataLoader + time-travel + delete + sharded ingest — lives in docs/tutorial.md. For a substantial real-data ingest at scale, see dreamdb-dataset-python/examples/ingest_imagenet100_clip.py.
Repository layout
Spec roadmap
The protocol spec is in spec/. Start with spec/0000-overview.md, then read in order — each doc inherits vocabulary and decisions from the ones before it. INDEX.md is a navigation index with the full Open Questions audit.
What's covered (21 specs, v0 + Phase-3 + Phase-4 drafts): data model, content-addressing, time encoding, spatial indexing (LSH/IVF/IMI), backend interface, eight protocol verbs, streaming encapsulation, versioning + multi-parent merge, conformance, vector compression (PQ/RaBitQ), scalar indexing, federation, graph indexing (Vamana), streaming extensions, hybrid retrieval, streaming freshness, schema evolution, multi-tenant, encryption, tombstones.
Honest gaps (deferred to v0.1+):
- Observability: no spec for metrics / structured logging / health checks. SDKs log at debug/info but the surface isn't standardized.
- Cross-SDK interop: spec/0009 conformance covers protocol-level vectors but not "Rust ingester + Python reader" round-trips.
- Security model: spec/0019 covers encryption at rest; RBAC / capability tokens / audit logging are sketched in spec/0012 §5 but not standalone.
- Tombstone compaction: tombstones suppress on read (spec/0020 §5); the storage-reclamation operator is deferred.
- Chinese translations (
spec/chn/): cover specs 0000–0009 only; later specs not translated.
Reading order
For protocol understanding: spec/0000 → 0001 → 0002 → 0004 → 0007 → 0008 → 0006 (read these in this order, not numerical order).
For implementation reading: dreamdb-core/src/address.rs → dreamdb-protocol/src/manifest.rs → dreamdb-dataset/src/dataset.rs → dreamdb-dataset/src/dataset/append.rs.
For operator workflows: design/0006-10b-scale-blockers.md → design/0007-sharded-ingest.md.
License
Apache-2.0 / MIT dual.