DreamDB for ML Training — Tutorial
Companion to design/0003-scope-boundaries.md (the architecture) and
design/0004-todo-roadmap.md (what's still missing).
This tutorial walks through the canonical ML training workflow DreamDB
was designed for: pull dataset → snapshot → train → branch per sweep →
compare → tag the winner. Code is real, drawn from
dreamdb-dataset-python/examples/.
DreamDB's pitch in one sentence: git for ML datasets. Snapshots are tags. Branches are branches. Refs are immutable by-name pointers at Manifest hashes. Everything is content-addressed, so storage costs dedup across runs.
Prerequisites
Backend URL convention: http://localhost:9000/<bucket-name>. Datasets
within a bucket are addressed by name (the ref_name).
1. Ingest a dataset
For a starting corpus, use the ingest_imagenet100_clip.py example. It
ingests ImageNet-100's parquet shards, computes CLIP embeddings, and
stores them with IVF partitioning + RaBitQ compression:
After this, the DreamDB Space has:
imagefield (JPEG bytes in Fragment Tracks)embeddingfield (RaBitQ-compressed CLIP vectors)label,splitscalar fields (categorical strings)
2. Open + snapshot
Pinning training to a snapshot is the foundation of reproducibility. Snapshots are 33-byte content pointers — free to create, free to keep.
The snapshot Ref never moves. Future appends to imagenet-100 don't
affect it.
3. Stream as Arrow batches
Dataset.iter_arrow_batches yields pyarrow.RecordBatch with one
column per field plus _anchor:
What's emitted today (P4.0):
image/video→Binarycolumnembedding→FixedSizeList<float32>[dim](decodes RaBitQ via the schema's VectorCompressor)- scalar fields → typed columns
_anchor→uint64
Perf note: the underlying iter is eager — all buckets are fetched
before any batches are returned. For 100K+ records expect a few minutes
of initial load. Streaming iter is P1 (design/0004).
4. PyTorch DataLoader integration
dreamdb_dataset.torch.DreamDBDataset is an IterableDataset wrapping
iter_arrow_batches:
Multi-worker: each worker keeps every Nth batch. Cheap, deterministic, no overlap.
5. The sweep pattern
The canonical "branch per config" workflow:
Each branch is a 33-byte Ref. Content-addressing means image bytes are
stored ONCE; each run only adds its prediction Track. A 100-run
sweep on imagenet-100 adds ~50 MB total on top of the ~5 GB image
corpus.
6. Compare runs side-by-side
compare_refs joins all sweep outputs by anchor into one wide
pyarrow.Table:
The CLI equivalent (dreamdb-cli compare-refs) prints a pairwise
agreement matrix to stdout. Useful for quick sweep summaries.
7. Tag the winner
After picking the best run, label the snapshot with the model's identity. Future audits resolve it back:
This is the audit trail that lets you answer "what data trained this model?" without a separate metadata service.
Patterns
Active learning loop
Linear-probe + frozen backbone
Distillation
What's coming (P1)
The P4 first-real-run surfaced these priorities (in order of pain):
- Streaming iter —
Dataset.iter_stream()returningAsyncIteratorinstead of materializing everything. Critical at 1B-scale. - More parallel fetch — Fragment + Scalar fetches still serial in
iter_time_range(embedding fetches already parallelized in P4.0). - Embedding-only iter mode — skip image fetch when training only needs embeddings (big speedup on image-heavy corpora).
- Schema migration verb — opt existing datasets into new features
(e.g. enable
rerank=Truepost-hoc) without re-ingesting.
Track at design/0004-todo-roadmap.md.
Recap
DreamDB provides:
- Immutability + time-travel: every state is a Manifest; refs are by-name pointers
- Multimodal joins: image + embedding + label come from one
itercall - Branching for free: 33-byte PUT to create a new Ref
- Storage dedup: content-addressed objects shared across refs
- PyTorch / Arrow native: idiomatic for ML pipelines
What DreamDB doesn't try to be:
- A training framework (use PyTorch / JAX / etc.)
- A column-scan database (use LanceDB / Parquet for static datasets)
- An experiment tracker (use W&B / MLflow on top)
It's the substrate that the other tools sit on. Pick DreamDB when your dataset is alive: continuously appended, multi-source, multi-version, multimodal.