DreamDBv0.2.0bec026

DreamDB for ML Training — Tutorial

Companion to design/0003-scope-boundaries.md (the architecture) and design/0004-todo-roadmap.md (what's still missing).

This tutorial walks through the canonical ML training workflow DreamDB was designed for: pull dataset → snapshot → train → branch per sweep → compare → tag the winner. Code is real, drawn from dreamdb-dataset-python/examples/.

DreamDB's pitch in one sentence: git for ML datasets. Snapshots are tags. Branches are branches. Refs are immutable by-name pointers at Manifest hashes. Everything is content-addressed, so storage costs dedup across runs.

Prerequisites

bash
# 1. MinIO (or any S3-compatible backend).
docker run -d --name dreamdb-minio -p 9000:9000 minio/minio server /data

# 2. Install dreamdb_dataset Python wheel.
VIRTUAL_ENV=./venv maturin develop --release \
    --manifest-path dreamdb-dataset-python/Cargo.toml

# 3. Python deps for ML.
pip install torch pyarrow numpy open_clip_torch

Backend URL convention: http://localhost:9000/<bucket-name>. Datasets within a bucket are addressed by name (the ref_name).

1. Ingest a dataset

For a starting corpus, use the ingest_imagenet100_clip.py example. It ingests ImageNet-100's parquet shards, computes CLIP embeddings, and stores them with IVF partitioning + RaBitQ compression:

bash
python examples/ingest_imagenet100_clip.py \
    --root /path/to/imagenet-100 \
    --backend http://localhost:9000/imagenet-100 \
    --dataset-name imagenet-100 \
    --splits train,validation \
    --ivf --rabitq

After this, the DreamDB Space has:

  • image field (JPEG bytes in Fragment Tracks)
  • embedding field (RaBitQ-compressed CLIP vectors)
  • label, split scalar fields (categorical strings)

2. Open + snapshot

Pinning training to a snapshot is the foundation of reproducibility. Snapshots are 33-byte content pointers — free to create, free to keep.

python
import dreamdb_dataset as vd

ds = vd.Dataset.open_ref(
    "imagenet-100",
    backend="http://localhost:9000/imagenet-100",
)
print(f"current_manifest: {ds.current_manifest()}")

# Pin to an immutable state.
snap = ds.snapshot(f"model-v2-prep-{int(time.time())}")
# snap is {"label": "...", "manifest": "...", "timeline": "..."}

The snapshot Ref never moves. Future appends to imagenet-100 don't affect it.

3. Stream as Arrow batches

Dataset.iter_arrow_batches yields pyarrow.RecordBatch with one column per field plus _anchor:

python
for batch in ds.iter_arrow_batches(
    batch_size=256,
    fields=["embedding", "label"],
    shuffle_seed=42,  # deterministic in-batch shuffle
):
    # Embeddings come as FixedSizeList<float32>[dim]
    embs = batch.column("embedding").values.to_numpy(zero_copy_only=False)
    embs = embs.reshape(batch.num_rows, 512)
    labels = batch.column("label").to_pylist()
    # ... training step

What's emitted today (P4.0):

  • image / videoBinary column
  • embeddingFixedSizeList<float32>[dim] (decodes RaBitQ via the schema's VectorCompressor)
  • scalar fields → typed columns
  • _anchoruint64

Perf note: the underlying iter is eager — all buckets are fetched before any batches are returned. For 100K+ records expect a few minutes of initial load. Streaming iter is P1 (design/0004).

4. PyTorch DataLoader integration

dreamdb_dataset.torch.DreamDBDataset is an IterableDataset wrapping iter_arrow_batches:

python
from dreamdb_dataset.torch import DreamDBDataset
import torch.utils.data

pinned = vd.Dataset.open_at(snap, backend="...")
train_ds = DreamDBDataset(
    dataset=pinned,
    fields=["embedding", "label"],
    batch_size=128,
    shuffle_seed=42,
)
loader = torch.utils.data.DataLoader(
    train_ds,
    batch_size=None,    # DreamDBDataset already produces batches
    num_workers=4,      # workers partition by batch_idx % num_workers
)

for batch in loader:
    embs = torch.from_numpy(batch["embedding"]).cuda()
    labels = torch.tensor(label_to_idx[batch["label"]]).cuda()
    # ... loss.backward() ...

Multi-worker: each worker keeps every Nth batch. Cheap, deterministic, no overlap.

5. The sweep pattern

The canonical "branch per config" workflow:

python
SWEEP_CONFIGS = [
    {"lr": 0.001, "batch_size": 128},
    {"lr": 0.01,  "batch_size": 128},
    {"lr": 0.1,   "batch_size": 128},
    {"lr": 0.001, "batch_size": 512},
    {"lr": 0.01,  "batch_size": 512},
    {"lr": 0.1,   "batch_size": 512},
]

src = vd.Dataset.open_ref("imagenet-100", backend="...")
snap = src.snapshot("sweep-baseline")

run_refs = []
for i, cfg in enumerate(SWEEP_CONFIGS):
    ref_name = f"sweep/lp-lr{cfg['lr']}-bs{cfg['batch_size']}"
    branch = src.branch(ref_name)

    # ... train a model with cfg, get per-record predictions ...
    predictions = train(snap, cfg)  # [(anchor, pred_label), ...]

    # Write predictions as a scalar layer on the branch.
    branch._inner.add_scalar_layer(
        "prediction", "embedding", "categorical", predictions
    )
    run_refs.append(ref_name)

Each branch is a 33-byte Ref. Content-addressing means image bytes are stored ONCE; each run only adds its prediction Track. A 100-run sweep on imagenet-100 adds ~50 MB total on top of the ~5 GB image corpus.

6. Compare runs side-by-side

compare_refs joins all sweep outputs by anchor into one wide pyarrow.Table:

python
table = vd.compare_refs(
    refs=run_refs,
    fields=["prediction"],
    backend="...",
)
# columns: _anchor, prediction@sweep/lp-lr0.001-bs128, prediction@..., ...
# rows: one per record across all runs

# Analyze in pandas / polars:
import polars as pl
df = pl.from_arrow(table)
df.select(pl.col("^prediction@.*$").n_unique()).head()

The CLI equivalent (dreamdb-cli compare-refs) prints a pairwise agreement matrix to stdout. Useful for quick sweep summaries.

7. Tag the winner

After picking the best run, label the snapshot with the model's identity. Future audits resolve it back:

python
best_run = max(summaries, key=lambda s: s["val_acc"])
src.snapshot(f"prod-2026-05-15-{best_run['model_hash']}")

# Months later:
# $ dreamdb-cli inspect --ref-name prod-2026-05-15-<hash>
# walks the Manifest DAG; shows the exact training set state.

This is the audit trail that lets you answer "what data trained this model?" without a separate metadata service.

Patterns

Active learning loop

python
ds = vd.Dataset.open_ref("corpus", backend="...")
for iteration in range(N):
    snap = ds.snapshot(f"iter-{iteration}")
    model = train(snap)
    hard = model.find_hard_examples(production_stream)
    labels = label_with_humans(hard)
    ds.append_many(labels)  # corpus grows; snapshots stay pinned

Linear-probe + frozen backbone

python
# Reads CLIP embeddings already in the dataset; trains a probe.
for batch in pinned.iter_arrow_batches(fields=["embedding", "label"]):
    embs = batch.column("embedding").values.to_numpy().reshape(-1, dim)
    y = label_to_idx_vectorized(batch.column("label"))
    logits = W @ embs.T
    loss = cross_entropy(logits, y)
    # ...

Distillation

python
# Source's embeddings are the teacher; train a student model.
# Student outputs are written to a branch as a new embedding layer.
branch.add_embedding_layer(
    name="embedding_student",
    parent_field="image",
    dim=128,                            # student is smaller
    algorithm="dreamdb.ivf-cosine",
    spatial_index=student_si_hash,      # operator-trained
    compressor=student_vc_hash,         # operator-published
    rerank=False,
    samples=student_outputs,            # [(anchor, vector), ...]
)

What's coming (P1)

The P4 first-real-run surfaced these priorities (in order of pain):

  1. Streaming iterDataset.iter_stream() returning AsyncIterator instead of materializing everything. Critical at 1B-scale.
  2. More parallel fetch — Fragment + Scalar fetches still serial in iter_time_range (embedding fetches already parallelized in P4.0).
  3. Embedding-only iter mode — skip image fetch when training only needs embeddings (big speedup on image-heavy corpora).
  4. Schema migration verb — opt existing datasets into new features (e.g. enable rerank=True post-hoc) without re-ingesting.

Track at design/0004-todo-roadmap.md.

Recap

DreamDB provides:

  • Immutability + time-travel: every state is a Manifest; refs are by-name pointers
  • Multimodal joins: image + embedding + label come from one iter call
  • Branching for free: 33-byte PUT to create a new Ref
  • Storage dedup: content-addressed objects shared across refs
  • PyTorch / Arrow native: idiomatic for ML pipelines

What DreamDB doesn't try to be:

  • A training framework (use PyTorch / JAX / etc.)
  • A column-scan database (use LanceDB / Parquet for static datasets)
  • An experiment tracker (use W&B / MLflow on top)

It's the substrate that the other tools sit on. Pick DreamDB when your dataset is alive: continuously appended, multi-source, multi-version, multimodal.