DreamDB TODO Roadmap
Snapshot 2026-05-15 (original), refreshed 2026-05-18 after the 10B-scale push (B1–B8 all ✅). Companion to 0002-known-flaws-retrospective.md (what's wrong) and 0003-scope-boundaries.md (where things belong).
Status as of 2026-05-18: every P0 SDK + CLI item listed below has shipped. Every P1 item that was on the 10B-blocker critical path (B1 streaming iter, B2 sharded ingest / union-merge, B3 sharded redispatch, B4 prefix-sharded GC, B6 rayon hash_vector, B8 tombstones) has shipped. The roadmap below is preserved as a historical snapshot; the authoritative current TODO is
design/0006-10b-scale-blockers.md("Quality-of-life follow-ups" section) plus the README's "Honest gaps" list.
This was the consolidated TODO across all four layers (Protocol / SDK / Operator / App) plus cross-cutting work. Priorities:
- P0 — blocks the canonical ML workflow (pull dataset → sweep → branch → compare). All ✅ shipped by 2026-05-18.
- P1 — significant production value; needed at 100M+ scale or for regulated deployments. Most ✅ shipped via 10B push.
- P2 — nice-to-have, future iteration.
Effort is rough LOC for code or roughly weeks for cross-cutting work. "Spec change" = needs a numbered amendment to spec/*.md.
P0: Make the sweep workflow first-class
The user-visible goal: user pulls dataset → sweep processing/training → adds tracks on a branch → compares results. This list is the minimum delta from where we are today.
SDK (Layer 2)
| Item | What | Effort | Depends on |
|---|---|---|---|
Dataset::add_embedding_layer | Mirror of existing add_scalar_layer. Lets a branch add a new embedding modality (e.g. embedding_bert_v2) without re-ingesting source data. Publishes new SI + new Track + new Manifest. | ~80 LOC | Phase 3.1 (chain-aware lineage) — done |
Dataset::snapshot(label) -> DatasetVersion | Real impl of the existing stub. Creates a named Ref at current tip via If-None-Match: * PUT. Returns a DatasetVersion capturing the manifest hash + label. | ~50 LOC | Dataset::branch — done |
Dataset::open_at(ref_or_snapshot) | Sibling of open() that takes a snapshot label OR a manifest hash; pins reads to that exact state. | ~30 LOC | snapshot |
Dataset::iter_arrow_batches(batch_size, fields, shuffle_seed) | Stream RecordBatches. Walks selected modality Tracks, joins by record ordinal, emits typed Arrow columns. Embeddings as FixedSizeList<f32>; images as Binary; scalars typed. | ~300 LOC + 1 new dep (arrow-array) | — |
dreamdb_dataset.torch.DreamDBDataset | IterableDataset subclass wrapping iter_arrow_batches. Supports num_workers > 0 via worker_init_fn partitioning by record-ordinal modulo. | ~150 LOC Python | iter_arrow_batches |
Dataset::compare_refs(refs, fields) -> Arrow table | Wide-form table: `record_id | field@ref_a | field@ref_b |
CLI (Layer 3)
| Item | What | Effort |
|---|---|---|
dreamdb-cli compare-refs --field <name> <ref-a> <ref-b> ... | Same as the SDK verb but command-line. Prints summary stats (mean/median diff for scalars; mean cosine sim for embeddings) and optionally writes the full Arrow table to a file. | ~100 LOC |
dreamdb-cli snapshot --ref <src> --label <label> | One-shot wrapper for Dataset::snapshot. | ~40 LOC |
dreamdb-cli inspect --ref <name> | Walk the Manifest DAG; print snapshot history with stats (record count delta, track delta, writer, ts) per snapshot. Useful for "what changed in the last 10 commits". | ~150 LOC |
Examples / App (Layer 4)
| Item | What | Effort |
|---|---|---|
examples/sweep_runner.py | End-to-end: open source, branch N times, run a fake training loop per branch, append results, exit. Use as a smoke test for the sweep workflow. | ~250 LOC |
examples/sweep_dashboard.html | Browser app: list all sweep/* refs, side-by-side comparison of scalar metrics, click-to-view individual records. | ~500 LOC (HTML+JS) |
examples/training_recipe.py | PyTorch fine-tuning loop using dreamdb_dataset.torch.DreamDBDataset. Snapshot before training, label snapshot after with model commit hash. | ~150 LOC |
examples/active_learning_loop.py | Demo: model.predict_uncertain → human labels → append → retrain on next snapshot. | ~200 LOC |
P0 total: ~2000 LOC across SDK/CLI/examples. Probably 2-3 weeks of focused effort.
P4 update (2026-05-15 evening): first real training run hit P1 priorities
A 100-class linear probe on imagenet-100's CLIP embeddings ran end-to-end
(examples/linear_probe.py). Final result: val_acc 4.9% → 36.0% over
5 epochs, training set 226,689 records × 512 dim. Two real gaps were
surfaced and fixed during the run:
| Gap | Status | Fix |
|---|---|---|
iter_arrow_batches didn't include embedding columns | ✅ Fixed (P4.0) | Extended iter_time_range to walk SpatialBucket tracks + decode via vc.decode |
| Paged TrackObjects rejected in iter path | ✅ Fixed (P4.0) | B-tree walk in fetch_spatial_bucket_track_entries |
| Eager fetch of 42K buckets serially → 486s load time | 🟡 Partially fixed | Added buffer_unordered(16) parallel fetch (measuring speedup in v4 run) |
| Streaming iter (returns Stream instead of Vec) | ⏳ Open | Real fix for the 1B-scale case; ~400 LOC |
Implication for P1 ordering: streaming iter is now the highest-leverage P1 item, demoted other items below it. Evidence-driven, not speculative.
P1.0 new (from v4 probe 2026-05-15): fields filter doesn't propagate
into Rust. iter_arrow_batches(fields=["embedding", "label"]) filters the
Arrow columns AFTER fetching, but Rust's iter_time_range still fetches
ALL blob fields including images. For a probe that doesn't need images,
this means hundreds of MB of unnecessary fetch. Fix: thread a fields: Option<HashSet<String>> parameter from Python through Filter and gate
the blob_fields / scalar_fields / embedding_fields loops on it. ~50 LOC,
high-leverage.
P1: Production-grade at scale
The work that's needed before DreamDB can replace a managed vector DB in a real deployment.
Spec (Layer 1)
| Item | What | Effort | Why P1 |
|---|---|---|---|
Tombstones (spec/0020?) | Define dreamdb.tombstones registry entry shape. Per-modality list of (track_position, anchor_hash) pairs. Query path skips tombstoned records. GC eventually compacts. | Spec amendment + ~400 LOC | GDPR; correction of bad records; mandatory for production. |
Schema evolution (spec/0017 exists as draft; implement) | Define what a schema migration CAN change without re-ingest: adding optional fields ✅, dropping fields ✅, changing existing field types ❌. Manifest carries schema_version: u32. | Spec + ~200 LOC | Multi-version dataset coexistence. |
| Phase 3.4b: Incremental Track B-tree update | Currently publish_spatial_bucket_track rebuilds the whole B-tree. With chain-aware lineage, only leaves containing CHANGED entries need new content; only their ancestor pages need re-PUTting. | Spec amendment to spec/0002 §7.3.2 clarifying chain-aware page reuse, plus ~300 LOC | At 1B-cell scale the full B-tree rewrite is the new dominant cost. |
Multi-parent merge semantics (spec/0008 extension) | Define what it means when a Manifest has parents = [A, B] and A's embedding modality has SI X, B's has SI Y. Conflict detection algorithm. | Spec amendment | Foundation for MergeStrategy::RefuseOnSiConflict. |
Streaming freshness (spec/0016 exists as draft; implement) | Records visible BEFORE a Manifest is published (committed-but-not-yet-published). Critical for low-latency append + query workflows. | Substantial spec + impl | Real-time use cases. |
| Address-scheme amendment | Move spatial_key off the bucket path; encode in Track entry only. Eliminates the re-PUT in cold-bucket spatial_key shift case from Phase 3.2. | Spec amendment (backwards-compat: dual-form during transition) | Saves ~1 HTTP PUT per cold-bucket-with-shift. Significant at large scale. |
SDK (Layer 2)
| Item | What | Effort |
|---|---|---|
MergeStrategy::RefuseOnSiConflict impl | Multi-parent Manifest construction. Uses chain-aware lineage (Phase 3.1) to find common ancestor SI. Conflict detection per spec/0008. | ~250 LOC |
Dataset::delete(ordinals: Vec<u64>) | Publish a Manifest with dreamdb.tombstones entry covering those ordinals. Subsequent reads skip them. GC eventually compacts. | ~150 LOC + tombstones spec |
Dataset::update_schema(...) | Schema migration verb. Validates the diff is forward-compatible; publishes a new Manifest with the updated schema in the registry; existing data untouched. | ~150 LOC + schema-evolution spec |
| Phase 3.3: Sharded redispatch | True multi-pod redispatch. Workers handle their slice's replaced cells; emit partial bucket-entry lists; orchestrator stitches into a paged Track. Four-phase k8s Job pipeline. | ~400 LOC + k8s YAML |
Paged-track read in Dataset::iter | Currently iter likely doesn't walk paged tracks (need to verify). Add flatten_paged_* helpers similar to ada-ivf-step's. | ~80 LOC |
| Local working-copy cache | Optional on-disk cache for SDK reads. Lets users "clone" a snapshot for offline access. Useful for laptop-based training on cloud-backed datasets. | ~400 LOC |
| HNSW algorithm | Alongside IVF and LSH. Better recall at moderate scale. Vamana algorithm + serialization already in protocol; needs query path + index-build. | ~600 LOC |
| IVF-PQ algorithm | Better compression-recall trade-off than RaBitQ. Faiss's classic combination. | ~400 LOC |
Parallel hash_vector via rayon | The IVF dispatch loop runs serially per query. Parallelize for queries that fan out to many cells. | ~30 LOC |
Dataset::time_anchor_iter(since, until) | Iterator that yields records whose time_anchor falls in a window. Useful for "give me data from the last 24h" without scanning the whole dataset. | ~80 LOC |
CLI (Layer 3)
| Item | What | Effort |
|---|---|---|
dreamdb-cli rebuild-ivf paged-track support | Currently rebuild-ivf writes inline; needs publish_spatial_bucket_track integration. | ~30 LOC |
dreamdb-cli diff <ref-a> <ref-b> | Set-diff of records: present in B but not A, and vice versa. | ~120 LOC |
dreamdb-cli sweep-init --source <ref> --runs <n> | Helper: create N branches from a source ref with naming convention sweep/run-NNN. | ~80 LOC |
dreamdb-cli sweep-summarize --pattern 'sweep/*' --metric loss | Tabulate scalar metrics across matching refs; pick best/worst. | ~150 LOC |
Operator templates (Layer 3 examples)
| Item | What | Effort |
|---|---|---|
Cron template: ada-ivf-status + conditional ada-ivf-step | Schedule-once-a-day cron that checks imbalance and triggers maintenance if needed. | ~60 lines YAML |
Cron template: dreamdb-cli gc --keep-since=7d daily | Standard 7-day-retention GC. | ~40 lines YAML |
| Prometheus exporter (shell) | Wraps ada-ivf-status and inspect output, emits Prom metrics. | ~80 lines shell |
| Argo Workflow: full rebuild + verify | Sharded rebuild + post-rebuild brute-force recall check. | ~150 lines YAML |
| Multi-region replication recipe | Use rclone or aws s3 sync between DreamDB buckets. Document because content-addressed Objects make this lossless. | Doc + ~30 line shell |
Quality (cross-cutting)
| Item | What | Effort |
|---|---|---|
| Conformance test suite for chain-aware lineage | Add to dreamdb-conformance: write a bucket under SI_A, evolve to SI_B with parents=[A], prove the bucket reads correctly. | ~150 LOC |
| Performance regression suite | dreamdb-bench extensions: rebuild throughput, query latency, ingest throughput. Run nightly. | ~300 LOC |
| Documentation site | mkdocs-material or similar. Tutorials, API reference, the design docs as published pages. | ~1 week |
P1 total: ~5000 LOC + 4 spec amendments + 1 week docs. 6-8 weeks of focused effort.
P2: Future / nice-to-have
Things worth doing eventually but not on the critical path.
Spec
| Item | What |
|---|---|
Federation (spec/0012 drafted; not implemented) | Cross-bucket queries. Each bucket is an independent DreamDB; queries can fan out across them. |
Encryption (spec/0019 drafted) | At-rest encryption of payload Objects. Keys managed by the operator. |
Hybrid retrieval (spec/0015 drafted) | Combine vector search with sparse / BM25 / structured-filter scoring. |
Multi-tenant isolation (spec/0018 drafted) | Per-tenant subset filtering at the connector layer. |
| Graph-ANN query path | spec/0013 defines Vamana algorithm; serialization exists; query verb missing. |
SDK
| Item | What |
|---|---|
dreamdb_dataset.jax.DreamDBDataset | JAX-equivalent of the PyTorch wrapper. |
dreamdb_dataset.tf.DreamDBDataset | TensorFlow. |
| Browser query optimizations | Web Workers for ADC scoring (currently single-threaded JS). |
| Brotli-compressed manifest CBOR | Cheaper transfer for huge manifests. |
| Async-batched HEAD in GC | Already shipped via buffer_unordered(32); could grow to buffer_unordered(256) against S3. |
| HTTP/3 connector support | Lower latency on lossy networks. |
Dataset::async_iter true-stream API | Currently iteration is async per-batch but eagerly buffered; expose a real backpressured stream. |
Apps
| Item | What |
|---|---|
| Real-time append demo | Live UI showing records appearing as they're appended. WebSocket bridge over the protocol. |
| ML annotation tool | UI for human-in-the-loop labeling, writing labels as scalar tracks. |
| Audit / lineage viewer | UI showing "this trained model used dataset snapshot X; X contained records from sources Y, Z; record at ordinal N was last modified by writer W". |
| Cost dashboard | "How much disk does each Ref/snapshot cost?" Using content-addressing dedup math. |
Quality
| Item | What |
|---|---|
| Conformance test suite extension to JS | Cross-language: same SI Object, JS query → Rust query → same results. |
| Fuzz testing of CBOR decoders | property-based round-trip + random-bytes-don't-crash. |
cargo deny policy file | License + advisory audit. |
| 0-deps Rust kernel option | Strip optional features for embedded use cases. |
Suggested execution order
Concretely, the most impactful next 4-6 weeks:
Week 1: P0 SDK foundation (Arrow bridge + snapshot + add_embedding_layer) Week 2: P0 PyTorch integration + compare_refs Week 3: P0 sweep_runner + sweep_dashboard examples Week 4: Memory/docs catch-up; user trial of the sweep workflow on a real dataset Week 5-6: P1 starts — pick between tombstones (regulated use cases) vs schema migration (multi-version coexistence) vs multi-parent merge (multi-team scenarios) depending on which user pain hits first.
The P0 batch unlocks DreamDB as a credible ML training data source. P1 makes it production-credible. P2 is the "vision" tier.
What today's day-of-work proved
Phases 1-3.4 (all shipped today, 2026-05-15) demonstrated that:
-
The protocol is well-grounded — every architectural fix had a clean spec amendment path or fit within existing semantics. Chain-aware lineage required just one new field (
parents); no other primitives needed to change. -
Scope boundaries are the key discipline — the most damaging thing we shipped (inline auto-rebuild) violated the protocol/operator boundary. Deleting it was the right call; the operator-driven CLI + cron pattern is the natural replacement.
-
Mechanism-first lets apps be thin — the four layers (Protocol → SDK → Operator → App) compose cleanly. The sweep dashboard example app will be ~500 LOC because the SDK does the heavy lifting; without the SDK primitives it'd be ~5000 LOC.
The TODO list above is large but every item maps to a real user pain or a known production gap. Nothing on it is speculative.