DreamDB

A searchable, versioned, distributed memory protocol for the multimodal information of human civilization.

DreamDB is a file and wire protocol for multimodal data — a storage and retrieval specification that any compliant backend (object store, distributed FS, content-addressed network) can implement. The spec describes how multimodal signals (video, audio, text, vectors) are anchored to a shared timeline, written immutably, addressed by their semantic features, and consumed as native streams.

Four first principles

Time is the sole primary key. Anchor everything to a high-precision timeline. Abolish human-defined IDs.
Immutability is the bedrock of collaboration. Append-only, content-addressed. New information is layered on, never overwritten.
Retrieval is probabilistic localization, not scanning. Vector features encode directly into storage paths. Search is a coordinate calculation, not a traversal.
Data is stream. Encapsulation is streaming-native. The index is the player's seek pointer.

What you get

DreamDB is a Rust reference implementation of the protocol (21 spec docs, ~9000 lines) plus a Python SDK that exposes it as a multimodal versioned dataset:

python

import dreamdb_dataset as vd

# Define a schema with image + CLIP embedding + label (chainable)
schema = (vd.Schema()
          .add_image("image", mime="jpeg")
          .add_embedding("embedding", dim=512, algorithm="dreamdb.ivf-cosine", rerank=True)
          .add_scalar_categorical("label"))

# (Or imperative if you prefer: schema = vd.Schema(); schema.add_image(...); ...)

# Create a Dataset on MinIO/S3
backend = "http://minio:9000/imagenet-1k"
ds = vd.Dataset.create("imagenet-1k", schema, backend=backend)

# Append a batch of samples (one dict per row, keyed by field name)
ds.append_many([
    {"image": jpeg_bytes, "embedding": clip_vec, "label": "tabby cat"},
    # … one dict per sample
])

# Snapshot the dataset at a point in time (for reproducible training)
v = ds.snapshot("baseline-2026-05-18")

# Query: top-K nearest neighbors via spatial indexing
hits = ds.iter_vector(field="embedding", query=clip_embed("a brown dog"), top_k=10)

# Sharded parallel ingest across N workers, then merge back
# (each worker calls ds.branch("worker-N") + append_many on its slice)
ds.merge_many(["worker-0", "worker-1", "worker-2"])

# GDPR-style deletion (read-side suppression; storage compaction is a separate op)
ds.delete(anchors=[1779083474791115000], reason="gdpr")

# Time-travel: open any prior snapshot
old = vd.Dataset.open_at(v, backend=backend)

The same protocol surface is available natively in Rust (dreamdb-dataset crate), via CLI (dreamdb binary for maintenance operations), and inside a browser (a JS implementation in dreamdb-dataset-python/python/dreamdb_dataset/web/ for client-side search demos).

Status (2026-05-18)

Component	Status	Notes
Protocol spec (specs 0000–0020)	✅ v0 complete	21 docs, ~9000 lines. Multi-parent fused-merge formalized in `spec/0008 §5.3`
Rust SDK (`dreamdb-dataset`)	✅ shipping	Dataset/Schema/Sample, ingest, iter_stream, snapshot/branch/merge, delete, tombstones
Python bindings	✅ shipping	PyO3 wheel; PyTorch IterableDataset + Arrow batches
CLI (`dreamdb` binary)	✅ shipping	ada-ivf-step (k8s 4-stage), gc (k8s-shardable), delete, merge-many, inspect, snapshot
HTTP connector (MinIO/S3)	✅ shipping	`dreamdb-connector-http`; conformance suite passes
Browser SDK + UI (`browse.html`)	✅ shipping	Time-travel viewer, semantic search, IVF+RaBitQ on the client
Conformance test vectors	⚠️ partial	CBOR / address / time / spatial covered (`dreamdb-conformance/`); cross-SDK interop deferred
Billion-scale benchmark	⏳ in progress	imagenet-1k (1.3M records) ingest running 2026-05-18; SIFT1M / LAION-100M next
Test suite	✅ 721 green	Across all crates; 0 failed
Production hardening	⚠️ pre-1.0	Observability spec, RBAC, language-binding stability deferred to v0.1+

Backends

The HTTP connector talks to any S3-shape endpoint. Choose one:

Backend	Setup	Auth
MinIO (local dev)	`docker run ... minio/minio` + `mc anonymous set public`	None — unsigned requests
AWS S3	Create bucket; set `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` / `AWS_REGION` env vars	SigV4 (auto-detected from env)
Cloudflare R2 / Backblaze B2 / Wasabi	Create bucket + S3-compat API token; same env vars as AWS	SigV4 (auto-detected)
`memory://`	Built-in in-process backend	None (testing only)

The connector detects AWS_* env vars and auto-signs with SigV4 when present. No code changes needed to migrate from local MinIO to production S3 — set the env vars and point at the new backend URL.

60-second quickstart

Requires Rust ≥1.83, Python ≥3.10, Docker (for MinIO), and uv (or any pip-equivalent).

bash

# 1. Clone + build
git clone https://github.com/<your-org>/dreamdb
cd dreamdb
cargo build --release            # ~5 min cold; <30s incremental

# 2. Start a MinIO backend
docker run -d --name dreamdb-minio \
  -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=dreamdb \
  -e MINIO_ROOT_PASSWORD=dreamdbsecret \
  minio/minio:latest server /data --console-address ":9001"
docker exec dreamdb-minio mc alias set local http://localhost:9000 dreamdb dreamdbsecret
docker exec dreamdb-minio mc mb local/demo
docker exec dreamdb-minio mc anonymous set public local/demo

# 3. Build + install the Python wheel
cd dreamdb-dataset-python
uv pip install maturin
maturin develop --release

# 4. A 30-line "hello world" — write 1000 random vectors, query nearest

Run with uv run hello_world.py

python

#!/usr/bin/env -S uv run
# /// script
# dependencies = ["numpy", "dreamdb-dataset"]
# ///

import numpy as np, dreamdb_dataset as vd

s = vd.Schema().add_embedding("v", dim=32, algorithm="dreamdb.lsh-cosine")
ds = vd.Dataset.create("hello", s, backend="http://localhost:9000/demo")

samples = [{"v": np.random.randn(32).astype(np.float32)} for _ in range(1000)]
ds.append_many(samples)

q = np.random.randn(32).astype(np.float32).tolist()
batches = ds.iter_vector(field="v", query=q, top_k=5)
for batch in batches:
    for anchor in batch["_time_anchors"]:
        print(f"hit: anchor={anchor}")

bash

# 5. Inspect the Manifest history
./target/release/dreamdb inspect \
  --backend http://localhost:9000/demo --ref-name hello --max-depth 10

A 10-minute hands-on walkthrough — schema + ingest + snapshot + query + PyTorch DataLoader + time-travel + delete + sharded ingest — lives in docs/tutorial.md. For a substantial real-data ingest at scale, see dreamdb-dataset-python/examples/ingest_imagenet100_clip.py.

Repository layout

dreamdb/
├── spec/                   # 21 protocol spec docs (start with spec/0000)
├── INDEX.md                # spec navigation + Open Questions audit
├── design/                 # implementation design docs (operator-facing)
├── dreamdb-core/            # addressing, hashing, CBOR primitives — shared by every layer
├── dreamdb-protocol/        # Object types (Manifest, Track, Bucket, ...) + verbs (Open/Append/Query/Stream)
├── dreamdb-connector/       # storage trait + memory backend
├── dreamdb-connector-http/  # MinIO/S3 implementation
├── dreamdb-dataset/         # high-level SDK (Schema, Sample, Dataset, Filter, Batch, MergeStrategy)
├── dreamdb-dataset-python/  # PyO3 bindings + Python wrapper + browser JS demo
├── dreamdb-cli/             # operator binaries (rebuild-ivf, ada-ivf-step, gc, delete, merge-many)
├── dreamdb-conformance/     # JSON test vectors per spec/0009 §3
└── dreamdb-bench/           # microbenchmarks (storage / latency / recall)

Spec roadmap

The protocol spec is in spec/. Start with spec/0000-overview.md, then read in order — each doc inherits vocabulary and decisions from the ones before it. INDEX.md is a navigation index with the full Open Questions audit.

What's covered (21 specs, v0 + Phase-3 + Phase-4 drafts): data model, content-addressing, time encoding, spatial indexing (LSH/IVF/IMI), backend interface, eight protocol verbs, streaming encapsulation, versioning + multi-parent merge, conformance, vector compression (PQ/RaBitQ), scalar indexing, federation, graph indexing (Vamana), streaming extensions, hybrid retrieval, streaming freshness, schema evolution, multi-tenant, encryption, tombstones.

Honest gaps (deferred to v0.1+):

Observability: no spec for metrics / structured logging / health checks. SDKs log at debug/info but the surface isn't standardized.
Cross-SDK interop: spec/0009 conformance covers protocol-level vectors but not "Rust ingester + Python reader" round-trips.
Security model: spec/0019 covers encryption at rest; RBAC / capability tokens / audit logging are sketched in spec/0012 §5 but not standalone.
Tombstone compaction: tombstones suppress on read (spec/0020 §5); the storage-reclamation operator is deferred.
Chinese translations (spec/chn/): cover specs 0000–0009 only; later specs not translated.

Reading order

For protocol understanding: spec/0000 → 0001 → 0002 → 0004 → 0007 → 0008 → 0006 (read these in this order, not numerical order).

For implementation reading: dreamdb-core/src/address.rs → dreamdb-protocol/src/manifest.rs → dreamdb-dataset/src/dataset.rs → dreamdb-dataset/src/dataset/append.rs.

For operator workflows: design/0006-10b-scale-blockers.md → design/0007-sharded-ingest.md.

License

Apache-2.0 / MIT dual.