A hands-on walkthrough that takes you from git clone to "I just ran semantic search over my own data, then deleted a record, then time-traveled back to before the deletion." Runnable end-to-end on a single machine.
We'll build a tiny dataset of 200 images + CLIP embeddings + labels, then:
snapshot the dataset for reproducibility
query it semantically
iterate it as a PyTorch DataLoader
open a prior snapshot (time-travel)
delete a record (tombstone)
view the history visually in a browser
run operator maintenance (ada-ivf-status, gc)
shard ingest across two workers and merge them back
By the end you'll have used every primitive in spec/0006's eight-verb taxonomy and every operator workflow in design/0006-10b-scale-blockers.md.
0. Prerequisites
Rust ≥ 1.83, Python ≥ 3.10, Docker, and uv (or any pip-equivalent).
~10 GB free disk for MinIO + the wheel.
The tutorial uses CLIP-ViT-B/32 for embeddings. CPU works; GPU (MPS on Apple Silicon, CUDA otherwise) makes step 3 ~5× faster.
The target/release/dreamdb binary will exist when this finishes.
2. Start MinIO
bash
docker run -d --name dreamdb-tutorial-minio \ -p 9000:9000 -p 9001:9001 \ -e MINIO_ROOT_USER=dreamdb \ -e MINIO_ROOT_PASSWORD=dreamdbsecret \ minio/minio:latest server /data --console-address ":9001"# Set up the CLI client + create the tutorial bucket (public for simplicity).docker exec dreamdb-tutorial-minio mc alias set local http://localhost:9000 dreamdb dreamdbsecretdocker exec dreamdb-tutorial-minio mc mb local/tutorialdocker exec dreamdb-tutorial-minio mc anonymous set public local/tutorial
Backend URL throughout this tutorial: http://localhost:9000/tutorial.
3. Install the Python wheel
bash
cd dreamdb-dataset-pythonuv pip install maturinmaturin develop --release # ~2 min; builds the PyO3 wheel
After this, import dreamdb_dataset works in any Python ≥3.10 environment that uses this venv.
4. Hello world — 1000 random vectors
The smallest end-to-end DreamDB program. Schema with one embedding field, 1000 random samples, a top-K query.
Run with uv run hello_world.py
python
#!/usr/bin/env -S uv run# /// script# dependencies = ["numpy", "dreamdb-dataset"]# ///import numpy as npimport dreamdb_dataset as vdBACKEND = "http://localhost:9000/tutorial"# 4a. Define a schema and create the Dataset.schema = vd.Schema().add_embedding("v", dim=32, algorithm="dreamdb.lsh-cosine")ds = vd.Dataset.create("hello", schema, backend=BACKEND)# 4b. Append 1000 random unit vectors. Samples are plain dicts.rng = np.random.default_rng(42)samples = [{"v": rng.standard_normal(32).astype(np.float32)} for _ in range(1000)]ds.append_many(samples)# 4c. Query the top 5 nearest neighbors to a fresh random vector.q = rng.standard_normal(32).astype(np.float32).tolist()batches = ds.iter_vector(field="v", query=q, top_k=5)for batch in batches: for anchor in batch["_time_anchors"]: print(f" hit: anchor={anchor}")
Anchors are 64-bit time-anchors (the "T" in spec/0003 — DreamDB's only primary key).
5. A real multimodal dataset
Now with images, CLIP embeddings, and labels. We'll use 200 sample images bundled with dreamdb-dataset-python/examples/sample_images/ (download a handful manually or substitute any 200 JPEGs — the tutorial works with any small set).
Run with uv run ingest_clip.py
python
#!/usr/bin/env -S uv run# /// script# dependencies = ["numpy", "torch", "open-clip-torch", "pillow", "dreamdb-dataset"]# ///import io, glob, numpy as np, dreamdb_dataset as vdimport open_clip, torchfrom PIL import ImageBACKEND = "http://localhost:9000/tutorial"device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")# 5a. Load CLIP-ViT-B/32. ~600 MB download first time.model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")model = model.to(device).eval()# 5b. Schema: image (jpeg) + CLIP embedding (512-dim) + label (categorical).schema = (vd.Schema() .add_image("image", mime="jpeg") .add_embedding("embedding", dim=512, algorithm="dreamdb.lsh-cosine") .add_scalar_categorical("label"))ds = vd.Dataset.create("multimodal", schema, backend=BACKEND)# 5c. Read up to 200 JPEGs from any folder you have. (Adjust path.)paths = sorted(glob.glob("/Users/me/Pictures/*.jpg"))[:200]print(f"found {len(paths)} images")samples = []clip_imgs = []for p in paths: with open(p, "rb") as f: jpeg = f.read() img = Image.open(io.BytesIO(jpeg)).convert("RGB") clip_imgs.append(preprocess(img)) samples.append({"image": jpeg, "label": p.split("/")[-1]})# 5d. CLIP-encode in one batch and attach embeddings.with torch.no_grad(): feats = model.encode_image(torch.stack(clip_imgs).to(device)) feats = feats / feats.norm(dim=-1, keepdim=True)feats = feats.cpu().float().numpy()for s, f in zip(samples, feats): s["embedding"] = f# 5e. Append. One round-trip per modality, regardless of batch size.ds.append_many(samples)print(f"appended {len(samples)} samples; manifest hash = {ds.current_manifest()}")
6. Snapshot the dataset
Snapshots are immutable named labels pinning a Manifest hash. Use them for "training run X used dataset state Y" reproducibility.
A snapshot is just a Ref pointing at a specific Manifest. No data copy — the 200-sample dataset is now reachable at TWO refs: multimodal (the live working ref) and baseline-200 (frozen). They share storage.
7. Query semantically
python
#!/usr/bin/env -S uv run# /// script# dependencies = ["torch", "open-clip-torch", "dreamdb-dataset"]# ///import open_clip, torch, dreamdb_dataset as vd# Encode a text query with CLIP's text tower.model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")tokenizer = open_clip.get_tokenizer("ViT-B-32")with torch.no_grad(): q = model.encode_text(tokenizer(["a sunset"])) q = (q / q.norm(dim=-1, keepdim=True))[0].numpy()ds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")batches = ds.iter_vector(field="embedding", query=q.tolist(), top_k=10)for batch in batches: for anchor, label in zip(batch["_time_anchors"], batch["label"]): print(f" hit: anchor={anchor} label={label}")
CLIP image and text embeddings share a space, so a text query finds visually-matching images.
8. Iterate as a PyTorch DataLoader
python
#!/usr/bin/env -S uv run# /// script# dependencies = ["torch", "pyarrow", "numpy", "dreamdb-dataset"]# ///import dreamdb_dataset as vdfrom dreamdb_dataset.torch import DreamDBIterableDatasetfrom torch.utils.data import DataLoaderds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")torch_ds = DreamDBIterableDataset(ds, batch_size=32, fields=["embedding", "label"])loader = DataLoader(torch_ds, batch_size=None, num_workers=0)for batch in loader: # `batch` is a pyarrow.RecordBatch with one column per requested field # plus `_anchor`. Convert as needed for your training loop. embeddings = batch.column("embedding").to_numpy(zero_copy_only=False) print(f"batch shape: {embeddings.shape}") break
DreamDBIterableDataset is a torch.utils.data.IterableDataset over the dataset's Arrow batches. Multi-worker support partitions batches by index across workers.
9. Time-travel — open a prior snapshot
python
import dreamdb_dataset as vd# Open the dataset at the "baseline-200" snapshot specifically.v = {"label": "baseline-200", "manifest": None, "timeline": None}# (You can also pass the dict returned by `ds.snapshot(...)` directly.)old = vd.Dataset.open_ref("baseline-200", backend="http://localhost:9000/tutorial")print(f"manifest at baseline-200: {old.current_manifest()}")
The "live" multimodal ref might have moved on (more appends, ada-ivf rebuilds, deletes), but the baseline-200 snapshot is byte-identical to what was there at snapshot time — every Object it transitively references is content-addressed and immutable.
10. Visualize in a browser
bash
# Serve the static browser UI (no build step — it's plain JS).cd dreamdb-dataset-python/examples/webpython -m http.server 8080
Open http://localhost:8080/browse.html in Chrome. Point it at the backend URL http://localhost:9000/tutorial and the ref name multimodal. You'll see:
A scrollable grid of the 200 images you appended.
A search box that runs semantic queries against your CLIP embeddings, client-side.
A "⏳ history" button that lists every Manifest in the ref's history; click one to re-render the entire view at that point in time. (This is the "killer feature" — the same dataset, at any past moment, with one click.)
11. Operator maintenance
Two single-machine ops every DreamDB operator runs.
Reports per-cell record counts, the imbalance score, and a MONITOR / SPLIT / MERGE recommendation. At 200 records this will be MONITOR (k=√200 ≈ 14 cells, ~14 records each, nothing to do).
Reports how many Objects would be reclaimed. At billion-scale, run with --shard N --of M across multiple k8s pods.
12. Delete a record (tombstone)
GDPR-style suppression: read paths skip the named anchor; storage compaction is a separate operator pass (deferred per spec/0020 §6).
bash
# Pick an anchor from step 7's query output — e.g. 1779090000000000456../target/release/dreamdb delete \ --backend http://localhost:9000/tutorial \ --ref-name multimodal \ --reason gdpr \ 1779090000000000456
Or from Python:
python
ds = vd.Dataset.open_ref("multimodal", backend=BACKEND)new_hash = ds.delete([1779090000000000456], reason="gdpr")print("tombstone set now:", ds.tombstone_set())
Now re-run the semantic query from step 7. The deleted anchor is gone. But re-run it against the baseline-200 snapshot — the record is still there, because that snapshot pins the Manifest from before the delete:
python
old = vd.Dataset.open_ref("baseline-200", backend=BACKEND)batches = old.iter_vector(field="embedding", query=q.tolist(), top_k=10)# the deleted anchor still appears here
This is DreamDB's time-travel property in action: deletion doesn't rewrite history, it adds a Manifest with a dreamdb.tombstones registry entry. Old snapshots that don't reference the tombstone see the original data.
13. Sharded ingest with merge-many
For real-scale ingest, run N workers each ingesting into their own branch, then merge them all into trunk in one shot.
Worker 0 -- branch and append its slice:
python
#!/usr/bin/env -S uv run# /// script# dependencies = ["numpy", "dreamdb-dataset"]# ///import numpy as np, dreamdb_dataset as vdds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")w = ds.branch("ingest-w-0")rng = np.random.default_rng(0)w.append_many([{"image": b"fake-jpeg-0", "embedding": rng.standard_normal(512).astype(np.float32), "label": f"w0-{i}"} for i in range(50)])
Worker 1 -- same, different slice:
python
#!/usr/bin/env -S uv run# /// script# dependencies = ["numpy", "dreamdb-dataset"]# ///import numpy as np, dreamdb_dataset as vdds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")w = ds.branch("ingest-w-1")rng = np.random.default_rng(1)w.append_many([{"image": b"fake-jpeg-1", "embedding": rng.standard_normal(512).astype(np.float32), "label": f"w1-{i}"} for i in range(50)])
Orchestrator merges both branches into trunk in one shot -- either via CLI:
The orchestrator publishes a new Manifest with parents = [trunk_tip, w0_tip, w1_tip] (chained sequentially in v0). At 10B scale you'd run 64 workers in parallel against a k8s Job; the merge-many step takes ~minutes regardless. See design/0007-sharded-ingest.md for the algorithm and the k8s YAML pattern.
14. Moving to S3 (optional)
When local MinIO outgrows your laptop disk, the path to production is just env vars — no code changes:
bash
export AWS_ACCESS_KEY_ID="AKIA…"export AWS_SECRET_ACCESS_KEY="…"export AWS_REGION="us-east-1"# Same Python / CLI commands as above; just point at the S3 bucket.python -c 'import dreamdb_dataset as vd; print(vd.Dataset.open_ref("multimodal", backend="https://s3.us-east-1.amazonaws.com/my-bucket").count())'
The connector auto-detects the env vars and signs every request with SigV4. Works against AWS S3, Cloudflare R2, Backblaze B2, Wasabi, and any other S3-compatible endpoint. For R2, set AWS_REGION=auto and use your R2 endpoint URL.
15. What now?
You've used the full DreamDB surface. Here's where to read next:
For the protocol: spec/0000-overview.md is the entry point; INDEX.md is the navigation index for all 21 specs.
For real ingest at scale: design/0006-10b-scale-blockers.md (everything that makes DreamDB 10B-ready).
For sharded ingest mechanics: design/0007-sharded-ingest.md.
For the Rust SDK: start at dreamdb-dataset/src/dataset.rs; the public surface is small and well-documented.
For operator workflows: dreamdb --help lists every CLI verb.
The tutorial bucket and all 200 samples + 1000 random vectors are gone. DreamDB itself is just files on disk; nothing else persists.
Status notes
This tutorial reflects the v0 reference implementation as of 2026-05-18. Two things to know:
iter_arrow_batches materializes the whole dataset into memory before yielding. For 10B-scale Python streaming, use Dataset.iter_stream(batch_size, fields) instead — it returns a true generator (StreamBatchIter) backed by a Rust mpsc channel, with bounded per-batch RAM. Embedding-only in v0; multi-modality merge-join is the next streaming-iter slice.
CLIP-ViT-B/32 on M-series MPS runs at ~200-300 samples/sec — that's the GPU forward-pass limit, not a DreamDB bottleneck. Production ingest uses sharded workers (step 13).