DreamDBv0.2.0bec026

DreamDB TODO Roadmap

Snapshot 2026-05-15 (original), refreshed 2026-05-18 after the 10B-scale push (B1–B8 all ✅). Companion to 0002-known-flaws-retrospective.md (what's wrong) and 0003-scope-boundaries.md (where things belong).

Status as of 2026-05-18: every P0 SDK + CLI item listed below has shipped. Every P1 item that was on the 10B-blocker critical path (B1 streaming iter, B2 sharded ingest / union-merge, B3 sharded redispatch, B4 prefix-sharded GC, B6 rayon hash_vector, B8 tombstones) has shipped. The roadmap below is preserved as a historical snapshot; the authoritative current TODO is design/0006-10b-scale-blockers.md ("Quality-of-life follow-ups" section) plus the README's "Honest gaps" list.

This was the consolidated TODO across all four layers (Protocol / SDK / Operator / App) plus cross-cutting work. Priorities:

  • P0 — blocks the canonical ML workflow (pull dataset → sweep → branch → compare). All ✅ shipped by 2026-05-18.
  • P1 — significant production value; needed at 100M+ scale or for regulated deployments. Most ✅ shipped via 10B push.
  • P2 — nice-to-have, future iteration.

Effort is rough LOC for code or roughly weeks for cross-cutting work. "Spec change" = needs a numbered amendment to spec/*.md.


P0: Make the sweep workflow first-class

The user-visible goal: user pulls dataset → sweep processing/training → adds tracks on a branch → compares results. This list is the minimum delta from where we are today.

SDK (Layer 2)

ItemWhatEffortDepends on
Dataset::add_embedding_layerMirror of existing add_scalar_layer. Lets a branch add a new embedding modality (e.g. embedding_bert_v2) without re-ingesting source data. Publishes new SI + new Track + new Manifest.~80 LOCPhase 3.1 (chain-aware lineage) — done
Dataset::snapshot(label) -> DatasetVersionReal impl of the existing stub. Creates a named Ref at current tip via If-None-Match: * PUT. Returns a DatasetVersion capturing the manifest hash + label.~50 LOCDataset::branch — done
Dataset::open_at(ref_or_snapshot)Sibling of open() that takes a snapshot label OR a manifest hash; pins reads to that exact state.~30 LOCsnapshot
Dataset::iter_arrow_batches(batch_size, fields, shuffle_seed)Stream RecordBatches. Walks selected modality Tracks, joins by record ordinal, emits typed Arrow columns. Embeddings as FixedSizeList<f32>; images as Binary; scalars typed.~300 LOC + 1 new dep (arrow-array)
dreamdb_dataset.torch.DreamDBDatasetIterableDataset subclass wrapping iter_arrow_batches. Supports num_workers > 0 via worker_init_fn partitioning by record-ordinal modulo.~150 LOC Pythoniter_arrow_batches
Dataset::compare_refs(refs, fields) -> Arrow tableWide-form table: `record_idfield@ref_afield@ref_b

CLI (Layer 3)

ItemWhatEffort
dreamdb-cli compare-refs --field <name> <ref-a> <ref-b> ...Same as the SDK verb but command-line. Prints summary stats (mean/median diff for scalars; mean cosine sim for embeddings) and optionally writes the full Arrow table to a file.~100 LOC
dreamdb-cli snapshot --ref <src> --label <label>One-shot wrapper for Dataset::snapshot.~40 LOC
dreamdb-cli inspect --ref <name>Walk the Manifest DAG; print snapshot history with stats (record count delta, track delta, writer, ts) per snapshot. Useful for "what changed in the last 10 commits".~150 LOC

Examples / App (Layer 4)

ItemWhatEffort
examples/sweep_runner.pyEnd-to-end: open source, branch N times, run a fake training loop per branch, append results, exit. Use as a smoke test for the sweep workflow.~250 LOC
examples/sweep_dashboard.htmlBrowser app: list all sweep/* refs, side-by-side comparison of scalar metrics, click-to-view individual records.~500 LOC (HTML+JS)
examples/training_recipe.pyPyTorch fine-tuning loop using dreamdb_dataset.torch.DreamDBDataset. Snapshot before training, label snapshot after with model commit hash.~150 LOC
examples/active_learning_loop.pyDemo: model.predict_uncertain → human labels → append → retrain on next snapshot.~200 LOC

P0 total: ~2000 LOC across SDK/CLI/examples. Probably 2-3 weeks of focused effort.


P4 update (2026-05-15 evening): first real training run hit P1 priorities

A 100-class linear probe on imagenet-100's CLIP embeddings ran end-to-end (examples/linear_probe.py). Final result: val_acc 4.9% → 36.0% over 5 epochs, training set 226,689 records × 512 dim. Two real gaps were surfaced and fixed during the run:

GapStatusFix
iter_arrow_batches didn't include embedding columns✅ Fixed (P4.0)Extended iter_time_range to walk SpatialBucket tracks + decode via vc.decode
Paged TrackObjects rejected in iter path✅ Fixed (P4.0)B-tree walk in fetch_spatial_bucket_track_entries
Eager fetch of 42K buckets serially → 486s load time🟡 Partially fixedAdded buffer_unordered(16) parallel fetch (measuring speedup in v4 run)
Streaming iter (returns Stream instead of Vec)⏳ OpenReal fix for the 1B-scale case; ~400 LOC

Implication for P1 ordering: streaming iter is now the highest-leverage P1 item, demoted other items below it. Evidence-driven, not speculative.

P1.0 new (from v4 probe 2026-05-15): fields filter doesn't propagate into Rust. iter_arrow_batches(fields=["embedding", "label"]) filters the Arrow columns AFTER fetching, but Rust's iter_time_range still fetches ALL blob fields including images. For a probe that doesn't need images, this means hundreds of MB of unnecessary fetch. Fix: thread a fields: Option&lt;HashSet&lt;String>> parameter from Python through Filter and gate the blob_fields / scalar_fields / embedding_fields loops on it. ~50 LOC, high-leverage.

P1: Production-grade at scale

The work that's needed before DreamDB can replace a managed vector DB in a real deployment.

Spec (Layer 1)

ItemWhatEffortWhy P1
Tombstones (spec/0020?)Define dreamdb.tombstones registry entry shape. Per-modality list of (track_position, anchor_hash) pairs. Query path skips tombstoned records. GC eventually compacts.Spec amendment + ~400 LOCGDPR; correction of bad records; mandatory for production.
Schema evolution (spec/0017 exists as draft; implement)Define what a schema migration CAN change without re-ingest: adding optional fields ✅, dropping fields ✅, changing existing field types ❌. Manifest carries schema_version: u32.Spec + ~200 LOCMulti-version dataset coexistence.
Phase 3.4b: Incremental Track B-tree updateCurrently publish_spatial_bucket_track rebuilds the whole B-tree. With chain-aware lineage, only leaves containing CHANGED entries need new content; only their ancestor pages need re-PUTting.Spec amendment to spec/0002 §7.3.2 clarifying chain-aware page reuse, plus ~300 LOCAt 1B-cell scale the full B-tree rewrite is the new dominant cost.
Multi-parent merge semantics (spec/0008 extension)Define what it means when a Manifest has parents = [A, B] and A's embedding modality has SI X, B's has SI Y. Conflict detection algorithm.Spec amendmentFoundation for MergeStrategy::RefuseOnSiConflict.
Streaming freshness (spec/0016 exists as draft; implement)Records visible BEFORE a Manifest is published (committed-but-not-yet-published). Critical for low-latency append + query workflows.Substantial spec + implReal-time use cases.
Address-scheme amendmentMove spatial_key off the bucket path; encode in Track entry only. Eliminates the re-PUT in cold-bucket spatial_key shift case from Phase 3.2.Spec amendment (backwards-compat: dual-form during transition)Saves ~1 HTTP PUT per cold-bucket-with-shift. Significant at large scale.

SDK (Layer 2)

ItemWhatEffort
MergeStrategy::RefuseOnSiConflict implMulti-parent Manifest construction. Uses chain-aware lineage (Phase 3.1) to find common ancestor SI. Conflict detection per spec/0008.~250 LOC
Dataset::delete(ordinals: Vec<u64>)Publish a Manifest with dreamdb.tombstones entry covering those ordinals. Subsequent reads skip them. GC eventually compacts.~150 LOC + tombstones spec
Dataset::update_schema(...)Schema migration verb. Validates the diff is forward-compatible; publishes a new Manifest with the updated schema in the registry; existing data untouched.~150 LOC + schema-evolution spec
Phase 3.3: Sharded redispatchTrue multi-pod redispatch. Workers handle their slice's replaced cells; emit partial bucket-entry lists; orchestrator stitches into a paged Track. Four-phase k8s Job pipeline.~400 LOC + k8s YAML
Paged-track read in Dataset::iterCurrently iter likely doesn't walk paged tracks (need to verify). Add flatten_paged_* helpers similar to ada-ivf-step's.~80 LOC
Local working-copy cacheOptional on-disk cache for SDK reads. Lets users "clone" a snapshot for offline access. Useful for laptop-based training on cloud-backed datasets.~400 LOC
HNSW algorithmAlongside IVF and LSH. Better recall at moderate scale. Vamana algorithm + serialization already in protocol; needs query path + index-build.~600 LOC
IVF-PQ algorithmBetter compression-recall trade-off than RaBitQ. Faiss's classic combination.~400 LOC
Parallel hash_vector via rayonThe IVF dispatch loop runs serially per query. Parallelize for queries that fan out to many cells.~30 LOC
Dataset::time_anchor_iter(since, until)Iterator that yields records whose time_anchor falls in a window. Useful for "give me data from the last 24h" without scanning the whole dataset.~80 LOC

CLI (Layer 3)

ItemWhatEffort
dreamdb-cli rebuild-ivf paged-track supportCurrently rebuild-ivf writes inline; needs publish_spatial_bucket_track integration.~30 LOC
dreamdb-cli diff <ref-a> <ref-b>Set-diff of records: present in B but not A, and vice versa.~120 LOC
dreamdb-cli sweep-init --source <ref> --runs <n>Helper: create N branches from a source ref with naming convention sweep/run-NNN.~80 LOC
dreamdb-cli sweep-summarize --pattern 'sweep/*' --metric lossTabulate scalar metrics across matching refs; pick best/worst.~150 LOC

Operator templates (Layer 3 examples)

ItemWhatEffort
Cron template: ada-ivf-status + conditional ada-ivf-stepSchedule-once-a-day cron that checks imbalance and triggers maintenance if needed.~60 lines YAML
Cron template: dreamdb-cli gc --keep-since=7d dailyStandard 7-day-retention GC.~40 lines YAML
Prometheus exporter (shell)Wraps ada-ivf-status and inspect output, emits Prom metrics.~80 lines shell
Argo Workflow: full rebuild + verifySharded rebuild + post-rebuild brute-force recall check.~150 lines YAML
Multi-region replication recipeUse rclone or aws s3 sync between DreamDB buckets. Document because content-addressed Objects make this lossless.Doc + ~30 line shell

Quality (cross-cutting)

ItemWhatEffort
Conformance test suite for chain-aware lineageAdd to dreamdb-conformance: write a bucket under SI_A, evolve to SI_B with parents=[A], prove the bucket reads correctly.~150 LOC
Performance regression suitedreamdb-bench extensions: rebuild throughput, query latency, ingest throughput. Run nightly.~300 LOC
Documentation sitemkdocs-material or similar. Tutorials, API reference, the design docs as published pages.~1 week

P1 total: ~5000 LOC + 4 spec amendments + 1 week docs. 6-8 weeks of focused effort.


P2: Future / nice-to-have

Things worth doing eventually but not on the critical path.

Spec

ItemWhat
Federation (spec/0012 drafted; not implemented)Cross-bucket queries. Each bucket is an independent DreamDB; queries can fan out across them.
Encryption (spec/0019 drafted)At-rest encryption of payload Objects. Keys managed by the operator.
Hybrid retrieval (spec/0015 drafted)Combine vector search with sparse / BM25 / structured-filter scoring.
Multi-tenant isolation (spec/0018 drafted)Per-tenant subset filtering at the connector layer.
Graph-ANN query pathspec/0013 defines Vamana algorithm; serialization exists; query verb missing.

SDK

ItemWhat
dreamdb_dataset.jax.DreamDBDatasetJAX-equivalent of the PyTorch wrapper.
dreamdb_dataset.tf.DreamDBDatasetTensorFlow.
Browser query optimizationsWeb Workers for ADC scoring (currently single-threaded JS).
Brotli-compressed manifest CBORCheaper transfer for huge manifests.
Async-batched HEAD in GCAlready shipped via buffer_unordered(32); could grow to buffer_unordered(256) against S3.
HTTP/3 connector supportLower latency on lossy networks.
Dataset::async_iter true-stream APICurrently iteration is async per-batch but eagerly buffered; expose a real backpressured stream.

Apps

ItemWhat
Real-time append demoLive UI showing records appearing as they're appended. WebSocket bridge over the protocol.
ML annotation toolUI for human-in-the-loop labeling, writing labels as scalar tracks.
Audit / lineage viewerUI showing "this trained model used dataset snapshot X; X contained records from sources Y, Z; record at ordinal N was last modified by writer W".
Cost dashboard"How much disk does each Ref/snapshot cost?" Using content-addressing dedup math.

Quality

ItemWhat
Conformance test suite extension to JSCross-language: same SI Object, JS query → Rust query → same results.
Fuzz testing of CBOR decodersproperty-based round-trip + random-bytes-don't-crash.
cargo deny policy fileLicense + advisory audit.
0-deps Rust kernel optionStrip optional features for embedded use cases.

Suggested execution order

Concretely, the most impactful next 4-6 weeks:

Week 1: P0 SDK foundation (Arrow bridge + snapshot + add_embedding_layer) Week 2: P0 PyTorch integration + compare_refs Week 3: P0 sweep_runner + sweep_dashboard examples Week 4: Memory/docs catch-up; user trial of the sweep workflow on a real dataset Week 5-6: P1 starts — pick between tombstones (regulated use cases) vs schema migration (multi-version coexistence) vs multi-parent merge (multi-team scenarios) depending on which user pain hits first.

The P0 batch unlocks DreamDB as a credible ML training data source. P1 makes it production-credible. P2 is the "vision" tier.


What today's day-of-work proved

Phases 1-3.4 (all shipped today, 2026-05-15) demonstrated that:

  1. The protocol is well-grounded — every architectural fix had a clean spec amendment path or fit within existing semantics. Chain-aware lineage required just one new field (parents); no other primitives needed to change.

  2. Scope boundaries are the key discipline — the most damaging thing we shipped (inline auto-rebuild) violated the protocol/operator boundary. Deleting it was the right call; the operator-driven CLI + cron pattern is the natural replacement.

  3. Mechanism-first lets apps be thin — the four layers (Protocol → SDK → Operator → App) compose cleanly. The sweep dashboard example app will be ~500 LOC because the SDK does the heavy lifting; without the SDK primitives it'd be ~5000 LOC.

The TODO list above is large but every item maps to a real user pain or a known production gap. Nothing on it is speculative.