DreamDB TODO Roadmap

Snapshot 2026-05-15 (original), refreshed 2026-05-18 after the 10B-scale push (B1–B8 all ✅). Companion to 0002-known-flaws-retrospective.md (what's wrong) and 0003-scope-boundaries.md (where things belong).

Status as of 2026-05-18: every P0 SDK + CLI item listed below has shipped. Every P1 item that was on the 10B-blocker critical path (B1 streaming iter, B2 sharded ingest / union-merge, B3 sharded redispatch, B4 prefix-sharded GC, B6 rayon hash_vector, B8 tombstones) has shipped. The roadmap below is preserved as a historical snapshot; the authoritative current TODO is design/0006-10b-scale-blockers.md ("Quality-of-life follow-ups" section) plus the README's "Honest gaps" list.

This was the consolidated TODO across all four layers (Protocol / SDK / Operator / App) plus cross-cutting work. Priorities:

P0 — blocks the canonical ML workflow (pull dataset → sweep → branch → compare). All ✅ shipped by 2026-05-18.
P1 — significant production value; needed at 100M+ scale or for regulated deployments. Most ✅ shipped via 10B push.
P2 — nice-to-have, future iteration.

Effort is rough LOC for code or roughly weeks for cross-cutting work. "Spec change" = needs a numbered amendment to spec/*.md.

P0: Make the sweep workflow first-class

The user-visible goal: user pulls dataset → sweep processing/training → adds tracks on a branch → compares results. This list is the minimum delta from where we are today.

SDK (Layer 2)

Item	What	Effort	Depends on
`Dataset::add_embedding_layer`	Mirror of existing `add_scalar_layer`. Lets a branch add a new embedding modality (e.g. `embedding_bert_v2`) without re-ingesting source data. Publishes new SI + new Track + new Manifest.	~80 LOC	Phase 3.1 (chain-aware lineage) — done
`Dataset::snapshot(label) -> DatasetVersion`	Real impl of the existing stub. Creates a named Ref at current tip via `If-None-Match: *` PUT. Returns a `DatasetVersion` capturing the manifest hash + label.	~50 LOC	`Dataset::branch` — done
`Dataset::open_at(ref_or_snapshot)`	Sibling of `open()` that takes a snapshot label OR a manifest hash; pins reads to that exact state.	~30 LOC	snapshot
`Dataset::iter_arrow_batches(batch_size, fields, shuffle_seed)`	Stream RecordBatches. Walks selected modality Tracks, joins by record ordinal, emits typed Arrow columns. Embeddings as `FixedSizeList<f32>`; images as `Binary`; scalars typed.	~300 LOC + 1 new dep (`arrow-array`)	—
`dreamdb_dataset.torch.DreamDBDataset`	`IterableDataset` subclass wrapping `iter_arrow_batches`. Supports `num_workers > 0` via `worker_init_fn` partitioning by record-ordinal modulo.	~150 LOC Python	iter_arrow_batches
`Dataset::compare_refs(refs, fields) -> Arrow table`	Wide-form table: `record_id	field@ref_a	field@ref_b

CLI (Layer 3)

Item	What	Effort
`dreamdb-cli compare-refs --field <name> <ref-a> <ref-b> ...`	Same as the SDK verb but command-line. Prints summary stats (mean/median diff for scalars; mean cosine sim for embeddings) and optionally writes the full Arrow table to a file.	~100 LOC
`dreamdb-cli snapshot --ref <src> --label <label>`	One-shot wrapper for `Dataset::snapshot`.	~40 LOC
`dreamdb-cli inspect --ref <name>`	Walk the Manifest DAG; print snapshot history with stats (record count delta, track delta, writer, ts) per snapshot. Useful for "what changed in the last 10 commits".	~150 LOC

Examples / App (Layer 4)

Item	What	Effort
`examples/sweep_runner.py`	End-to-end: open source, branch N times, run a fake training loop per branch, append results, exit. Use as a smoke test for the sweep workflow.	~250 LOC
`examples/sweep_dashboard.html`	Browser app: list all `sweep/*` refs, side-by-side comparison of scalar metrics, click-to-view individual records.	~500 LOC (HTML+JS)
`examples/training_recipe.py`	PyTorch fine-tuning loop using `dreamdb_dataset.torch.DreamDBDataset`. Snapshot before training, label snapshot after with model commit hash.	~150 LOC
`examples/active_learning_loop.py`	Demo: model.predict_uncertain → human labels → append → retrain on next snapshot.	~200 LOC

P0 total: ~2000 LOC across SDK/CLI/examples. Probably 2-3 weeks of focused effort.

P4 update (2026-05-15 evening): first real training run hit P1 priorities

A 100-class linear probe on imagenet-100's CLIP embeddings ran end-to-end (examples/linear_probe.py). Final result: val_acc 4.9% → 36.0% over 5 epochs, training set 226,689 records × 512 dim. Two real gaps were surfaced and fixed during the run:

Gap	Status	Fix
`iter_arrow_batches` didn't include embedding columns	✅ Fixed (P4.0)	Extended `iter_time_range` to walk SpatialBucket tracks + decode via `vc.decode`
Paged TrackObjects rejected in iter path	✅ Fixed (P4.0)	B-tree walk in `fetch_spatial_bucket_track_entries`
Eager fetch of 42K buckets serially → 486s load time	🟡 Partially fixed	Added `buffer_unordered(16)` parallel fetch (measuring speedup in v4 run)
Streaming iter (returns Stream instead of Vec)	⏳ Open	Real fix for the 1B-scale case; ~400 LOC

Implication for P1 ordering: streaming iter is now the highest-leverage P1 item, demoted other items below it. Evidence-driven, not speculative.

P1.0 new (from v4 probe 2026-05-15): fields filter doesn't propagate into Rust. iter_arrow_batches(fields=["embedding", "label"]) filters the Arrow columns AFTER fetching, but Rust's iter_time_range still fetches ALL blob fields including images. For a probe that doesn't need images, this means hundreds of MB of unnecessary fetch. Fix: thread a fields: Option<HashSet<String>> parameter from Python through Filter and gate the blob_fields / scalar_fields / embedding_fields loops on it. ~50 LOC, high-leverage.

P1: Production-grade at scale

The work that's needed before DreamDB can replace a managed vector DB in a real deployment.

Spec (Layer 1)

Item	What	Effort	Why P1
Tombstones (`spec/0020`?)	Define `dreamdb.tombstones` registry entry shape. Per-modality list of `(track_position, anchor_hash)` pairs. Query path skips tombstoned records. GC eventually compacts.	Spec amendment + ~400 LOC	GDPR; correction of bad records; mandatory for production.
Schema evolution (`spec/0017` exists as draft; implement)	Define what a schema migration CAN change without re-ingest: adding optional fields ✅, dropping fields ✅, changing existing field types ❌. Manifest carries `schema_version: u32`.	Spec + ~200 LOC	Multi-version dataset coexistence.
Phase 3.4b: Incremental Track B-tree update	Currently `publish_spatial_bucket_track` rebuilds the whole B-tree. With chain-aware lineage, only leaves containing CHANGED entries need new content; only their ancestor pages need re-PUTting.	Spec amendment to `spec/0002 §7.3.2` clarifying chain-aware page reuse, plus ~300 LOC	At 1B-cell scale the full B-tree rewrite is the new dominant cost.
Multi-parent merge semantics (`spec/0008` extension)	Define what it means when a Manifest has `parents = [A, B]` and A's embedding modality has SI `X`, B's has SI `Y`. Conflict detection algorithm.	Spec amendment	Foundation for `MergeStrategy::RefuseOnSiConflict`.
Streaming freshness (`spec/0016` exists as draft; implement)	Records visible BEFORE a Manifest is published (committed-but-not-yet-published). Critical for low-latency append + query workflows.	Substantial spec + impl	Real-time use cases.
Address-scheme amendment	Move `spatial_key` off the bucket path; encode in Track entry only. Eliminates the re-PUT in cold-bucket spatial_key shift case from Phase 3.2.	Spec amendment (backwards-compat: dual-form during transition)	Saves ~1 HTTP PUT per cold-bucket-with-shift. Significant at large scale.

SDK (Layer 2)

Item	What	Effort
`MergeStrategy::RefuseOnSiConflict` impl	Multi-parent Manifest construction. Uses chain-aware lineage (Phase 3.1) to find common ancestor SI. Conflict detection per spec/0008.	~250 LOC
`Dataset::delete(ordinals: Vec<u64>)`	Publish a Manifest with `dreamdb.tombstones` entry covering those ordinals. Subsequent reads skip them. GC eventually compacts.	~150 LOC + tombstones spec
`Dataset::update_schema(...)`	Schema migration verb. Validates the diff is forward-compatible; publishes a new Manifest with the updated schema in the registry; existing data untouched.	~150 LOC + schema-evolution spec
Phase 3.3: Sharded redispatch	True multi-pod redispatch. Workers handle their slice's replaced cells; emit partial bucket-entry lists; orchestrator stitches into a paged Track. Four-phase k8s Job pipeline.	~400 LOC + k8s YAML
Paged-track read in `Dataset::iter`	Currently `iter` likely doesn't walk paged tracks (need to verify). Add `flatten_paged_*` helpers similar to ada-ivf-step's.	~80 LOC
Local working-copy cache	Optional on-disk cache for SDK reads. Lets users "clone" a snapshot for offline access. Useful for laptop-based training on cloud-backed datasets.	~400 LOC
HNSW algorithm	Alongside IVF and LSH. Better recall at moderate scale. Vamana algorithm + serialization already in protocol; needs query path + index-build.	~600 LOC
IVF-PQ algorithm	Better compression-recall trade-off than RaBitQ. Faiss's classic combination.	~400 LOC
Parallel `hash_vector` via rayon	The IVF dispatch loop runs serially per query. Parallelize for queries that fan out to many cells.	~30 LOC
`Dataset::time_anchor_iter(since, until)`	Iterator that yields records whose time_anchor falls in a window. Useful for "give me data from the last 24h" without scanning the whole dataset.	~80 LOC

CLI (Layer 3)

Item	What	Effort
`dreamdb-cli rebuild-ivf` paged-track support	Currently `rebuild-ivf` writes inline; needs `publish_spatial_bucket_track` integration.	~30 LOC
`dreamdb-cli diff <ref-a> <ref-b>`	Set-diff of records: present in B but not A, and vice versa.	~120 LOC
`dreamdb-cli sweep-init --source <ref> --runs <n>`	Helper: create N branches from a source ref with naming convention `sweep/run-NNN`.	~80 LOC
`dreamdb-cli sweep-summarize --pattern 'sweep/*' --metric loss`	Tabulate scalar metrics across matching refs; pick best/worst.	~150 LOC

Operator templates (Layer 3 examples)

Item	What	Effort
Cron template: `ada-ivf-status` + conditional `ada-ivf-step`	Schedule-once-a-day cron that checks imbalance and triggers maintenance if needed.	~60 lines YAML
Cron template: `dreamdb-cli gc --keep-since=7d` daily	Standard 7-day-retention GC.	~40 lines YAML
Prometheus exporter (shell)	Wraps `ada-ivf-status` and `inspect` output, emits Prom metrics.	~80 lines shell
Argo Workflow: full rebuild + verify	Sharded rebuild + post-rebuild brute-force recall check.	~150 lines YAML
Multi-region replication recipe	Use rclone or aws s3 sync between DreamDB buckets. Document because content-addressed Objects make this lossless.	Doc + ~30 line shell

Quality (cross-cutting)

Item	What	Effort
Conformance test suite for chain-aware lineage	Add to `dreamdb-conformance`: write a bucket under SI_A, evolve to SI_B with `parents=[A]`, prove the bucket reads correctly.	~150 LOC
Performance regression suite	`dreamdb-bench` extensions: rebuild throughput, query latency, ingest throughput. Run nightly.	~300 LOC
Documentation site	mkdocs-material or similar. Tutorials, API reference, the design docs as published pages.	~1 week

P1 total: ~5000 LOC + 4 spec amendments + 1 week docs. 6-8 weeks of focused effort.

P2: Future / nice-to-have

Things worth doing eventually but not on the critical path.

Spec

Item	What
Federation (`spec/0012` drafted; not implemented)	Cross-bucket queries. Each bucket is an independent DreamDB; queries can fan out across them.
Encryption (`spec/0019` drafted)	At-rest encryption of payload Objects. Keys managed by the operator.
Hybrid retrieval (`spec/0015` drafted)	Combine vector search with sparse / BM25 / structured-filter scoring.
Multi-tenant isolation (`spec/0018` drafted)	Per-tenant subset filtering at the connector layer.
Graph-ANN query path	`spec/0013` defines Vamana algorithm; serialization exists; query verb missing.

SDK

Item	What
`dreamdb_dataset.jax.DreamDBDataset`	JAX-equivalent of the PyTorch wrapper.
`dreamdb_dataset.tf.DreamDBDataset`	TensorFlow.
Browser query optimizations	Web Workers for ADC scoring (currently single-threaded JS).
Brotli-compressed manifest CBOR	Cheaper transfer for huge manifests.
Async-batched HEAD in GC	Already shipped via `buffer_unordered(32)`; could grow to `buffer_unordered(256)` against S3.
HTTP/3 connector support	Lower latency on lossy networks.
`Dataset::async_iter` true-stream API	Currently iteration is async per-batch but eagerly buffered; expose a real backpressured stream.

Apps

Item	What
Real-time append demo	Live UI showing records appearing as they're appended. WebSocket bridge over the protocol.
ML annotation tool	UI for human-in-the-loop labeling, writing labels as scalar tracks.
Audit / lineage viewer	UI showing "this trained model used dataset snapshot X; X contained records from sources Y, Z; record at ordinal N was last modified by writer W".
Cost dashboard	"How much disk does each Ref/snapshot cost?" Using content-addressing dedup math.

Quality

Item	What
Conformance test suite extension to JS	Cross-language: same SI Object, JS query → Rust query → same results.
Fuzz testing of CBOR decoders	property-based round-trip + random-bytes-don't-crash.
`cargo deny` policy file	License + advisory audit.
0-deps Rust kernel option	Strip optional features for embedded use cases.

Suggested execution order

Concretely, the most impactful next 4-6 weeks:

Week 1: P0 SDK foundation (Arrow bridge + snapshot + add_embedding_layer) Week 2: P0 PyTorch integration + compare_refs Week 3: P0 sweep_runner + sweep_dashboard examples Week 4: Memory/docs catch-up; user trial of the sweep workflow on a real dataset Week 5-6: P1 starts — pick between tombstones (regulated use cases) vs schema migration (multi-version coexistence) vs multi-parent merge (multi-team scenarios) depending on which user pain hits first.

The P0 batch unlocks DreamDB as a credible ML training data source. P1 makes it production-credible. P2 is the "vision" tier.

What today's day-of-work proved

Phases 1-3.4 (all shipped today, 2026-05-15) demonstrated that:

The protocol is well-grounded — every architectural fix had a clean spec amendment path or fit within existing semantics. Chain-aware lineage required just one new field (parents); no other primitives needed to change.
Scope boundaries are the key discipline — the most damaging thing we shipped (inline auto-rebuild) violated the protocol/operator boundary. Deleting it was the right call; the operator-driven CLI + cron pattern is the natural replacement.
Mechanism-first lets apps be thin — the four layers (Protocol → SDK → Operator → App) compose cleanly. The sweep dashboard example app will be ~500 LOC because the SDK does the heavy lifting; without the SDK primitives it'd be ~5000 LOC.

The TODO list above is large but every item maps to a real user pain or a known production gap. Nothing on it is speculative.