DreamDBv0.2.0bec026

DreamDB in 10 minutes

A hands-on walkthrough that takes you from git clone to "I just ran semantic search over my own data, then deleted a record, then time-traveled back to before the deletion." Runnable end-to-end on a single machine.

We'll build a tiny dataset of 200 images + CLIP embeddings + labels, then:

  • snapshot the dataset for reproducibility
  • query it semantically
  • iterate it as a PyTorch DataLoader
  • open a prior snapshot (time-travel)
  • delete a record (tombstone)
  • view the history visually in a browser
  • run operator maintenance (ada-ivf-status, gc)
  • shard ingest across two workers and merge them back

By the end you'll have used every primitive in spec/0006's eight-verb taxonomy and every operator workflow in design/0006-10b-scale-blockers.md.

0. Prerequisites

  • Rust ≥ 1.83, Python ≥ 3.10, Docker, and uv (or any pip-equivalent).
  • ~10 GB free disk for MinIO + the wheel.

The tutorial uses CLIP-ViT-B/32 for embeddings. CPU works; GPU (MPS on Apple Silicon, CUDA otherwise) makes step 3 ~5× faster.

1. Clone and build

bash
git clone https://github.com/<your-org>/dreamdb
cd dreamdb
cargo build --release          # ~5 min cold; <30s incremental

The target/release/dreamdb binary will exist when this finishes.

2. Start MinIO

bash
docker run -d --name dreamdb-tutorial-minio \
  -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=dreamdb \
  -e MINIO_ROOT_PASSWORD=dreamdbsecret \
  minio/minio:latest server /data --console-address ":9001"

# Set up the CLI client + create the tutorial bucket (public for simplicity).
docker exec dreamdb-tutorial-minio mc alias set local http://localhost:9000 dreamdb dreamdbsecret
docker exec dreamdb-tutorial-minio mc mb local/tutorial
docker exec dreamdb-tutorial-minio mc anonymous set public local/tutorial

Backend URL throughout this tutorial: http://localhost:9000/tutorial.

3. Install the Python wheel

bash
cd dreamdb-dataset-python
uv pip install maturin
maturin develop --release      # ~2 min; builds the PyO3 wheel

After this, import dreamdb_dataset works in any Python ≥3.10 environment that uses this venv.

4. Hello world — 1000 random vectors

The smallest end-to-end DreamDB program. Schema with one embedding field, 1000 random samples, a top-K query.

Run with uv run hello_world.py

python
#!/usr/bin/env -S uv run
# /// script
# dependencies = ["numpy", "dreamdb-dataset"]
# ///

import numpy as np
import dreamdb_dataset as vd

BACKEND = "http://localhost:9000/tutorial"

# 4a. Define a schema and create the Dataset.
schema = vd.Schema().add_embedding("v", dim=32, algorithm="dreamdb.lsh-cosine")
ds = vd.Dataset.create("hello", schema, backend=BACKEND)

# 4b. Append 1000 random unit vectors. Samples are plain dicts.
rng = np.random.default_rng(42)
samples = [{"v": rng.standard_normal(32).astype(np.float32)} for _ in range(1000)]
ds.append_many(samples)

# 4c. Query the top 5 nearest neighbors to a fresh random vector.
q = rng.standard_normal(32).astype(np.float32).tolist()
batches = ds.iter_vector(field="v", query=q, top_k=5)
for batch in batches:
    for anchor in batch["_time_anchors"]:
        print(f"  hit: anchor={anchor}")

Output looks like:

  hit: anchor=1779090000000000001
  hit: anchor=1779090000000000123
  hit: anchor=1779090000000000456
  hit: anchor=1779090000000000789
  hit: anchor=1779090000000000999

Anchors are 64-bit time-anchors (the "T" in spec/0003 — DreamDB's only primary key).

5. A real multimodal dataset

Now with images, CLIP embeddings, and labels. We'll use 200 sample images bundled with dreamdb-dataset-python/examples/sample_images/ (download a handful manually or substitute any 200 JPEGs — the tutorial works with any small set).

Run with uv run ingest_clip.py

python
#!/usr/bin/env -S uv run
# /// script
# dependencies = ["numpy", "torch", "open-clip-torch", "pillow", "dreamdb-dataset"]
# ///

import io, glob, numpy as np, dreamdb_dataset as vd
import open_clip, torch
from PIL import Image

BACKEND = "http://localhost:9000/tutorial"
device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")

# 5a. Load CLIP-ViT-B/32. ~600 MB download first time.
model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")
model = model.to(device).eval()

# 5b. Schema: image (jpeg) + CLIP embedding (512-dim) + label (categorical).
schema = (vd.Schema()
          .add_image("image", mime="jpeg")
          .add_embedding("embedding", dim=512, algorithm="dreamdb.lsh-cosine")
          .add_scalar_categorical("label"))
ds = vd.Dataset.create("multimodal", schema, backend=BACKEND)

# 5c. Read up to 200 JPEGs from any folder you have. (Adjust path.)
paths = sorted(glob.glob("/Users/me/Pictures/*.jpg"))[:200]
print(f"found {len(paths)} images")

samples = []
clip_imgs = []
for p in paths:
    with open(p, "rb") as f:
        jpeg = f.read()
    img = Image.open(io.BytesIO(jpeg)).convert("RGB")
    clip_imgs.append(preprocess(img))
    samples.append({"image": jpeg, "label": p.split("/")[-1]})

# 5d. CLIP-encode in one batch and attach embeddings.
with torch.no_grad():
    feats = model.encode_image(torch.stack(clip_imgs).to(device))
    feats = feats / feats.norm(dim=-1, keepdim=True)
feats = feats.cpu().float().numpy()
for s, f in zip(samples, feats):
    s["embedding"] = f

# 5e. Append. One round-trip per modality, regardless of batch size.
ds.append_many(samples)
print(f"appended {len(samples)} samples; manifest hash = {ds.current_manifest()}")

6. Snapshot the dataset

Snapshots are immutable named labels pinning a Manifest hash. Use them for "training run X used dataset state Y" reproducibility.

python
import dreamdb_dataset as vd
ds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")
v = ds.snapshot("baseline-200")
print(f"snapshot label   = {v['label']}")
print(f"snapshot manifest = {v['manifest']}")
print(f"timeline          = {v['timeline']}")

A snapshot is just a Ref pointing at a specific Manifest. No data copy — the 200-sample dataset is now reachable at TWO refs: multimodal (the live working ref) and baseline-200 (frozen). They share storage.

7. Query semantically

python
#!/usr/bin/env -S uv run
# /// script
# dependencies = ["torch", "open-clip-torch", "dreamdb-dataset"]
# ///
import open_clip, torch, dreamdb_dataset as vd

# Encode a text query with CLIP's text tower.
model, _, _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")
tokenizer = open_clip.get_tokenizer("ViT-B-32")
with torch.no_grad():
    q = model.encode_text(tokenizer(["a sunset"]))
    q = (q / q.norm(dim=-1, keepdim=True))[0].numpy()

ds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")
batches = ds.iter_vector(field="embedding", query=q.tolist(), top_k=10)
for batch in batches:
    for anchor, label in zip(batch["_time_anchors"], batch["label"]):
        print(f"  hit: anchor={anchor} label={label}")

CLIP image and text embeddings share a space, so a text query finds visually-matching images.

8. Iterate as a PyTorch DataLoader

python
#!/usr/bin/env -S uv run
# /// script
# dependencies = ["torch", "pyarrow", "numpy", "dreamdb-dataset"]
# ///
import dreamdb_dataset as vd
from dreamdb_dataset.torch import DreamDBIterableDataset
from torch.utils.data import DataLoader

ds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")
torch_ds = DreamDBIterableDataset(ds, batch_size=32, fields=["embedding", "label"])
loader = DataLoader(torch_ds, batch_size=None, num_workers=0)

for batch in loader:
    # `batch` is a pyarrow.RecordBatch with one column per requested field
    # plus `_anchor`. Convert as needed for your training loop.
    embeddings = batch.column("embedding").to_numpy(zero_copy_only=False)
    print(f"batch shape: {embeddings.shape}")
    break

DreamDBIterableDataset is a torch.utils.data.IterableDataset over the dataset's Arrow batches. Multi-worker support partitions batches by index across workers.

9. Time-travel — open a prior snapshot

python
import dreamdb_dataset as vd

# Open the dataset at the "baseline-200" snapshot specifically.
v = {"label": "baseline-200", "manifest": None, "timeline": None}
# (You can also pass the dict returned by `ds.snapshot(...)` directly.)
old = vd.Dataset.open_ref("baseline-200", backend="http://localhost:9000/tutorial")
print(f"manifest at baseline-200: {old.current_manifest()}")

The "live" multimodal ref might have moved on (more appends, ada-ivf rebuilds, deletes), but the baseline-200 snapshot is byte-identical to what was there at snapshot time — every Object it transitively references is content-addressed and immutable.

10. Visualize in a browser

bash
# Serve the static browser UI (no build step — it's plain JS).
cd dreamdb-dataset-python/examples/web
python -m http.server 8080

Open http://localhost:8080/browse.html in Chrome. Point it at the backend URL http://localhost:9000/tutorial and the ref name multimodal. You'll see:

  • A scrollable grid of the 200 images you appended.
  • A search box that runs semantic queries against your CLIP embeddings, client-side.
  • A "⏳ history" button that lists every Manifest in the ref's history; click one to re-render the entire view at that point in time. (This is the "killer feature" — the same dataset, at any past moment, with one click.)

11. Operator maintenance

Two single-machine ops every DreamDB operator runs.

11a. Index health

bash
./target/release/dreamdb ada-ivf-status \
  --backend http://localhost:9000/tutorial \
  --timeline <paste-from-current_manifest-output> \
  --modality embedding.f32.dim=512.bucketed

Reports per-cell record counts, the imbalance score, and a MONITOR / SPLIT / MERGE recommendation. At 200 records this will be MONITOR (k=√200 ≈ 14 cells, ~14 records each, nothing to do).

11b. Garbage collection

bash
./target/release/dreamdb gc \
  --backend http://localhost:9000/tutorial \
  --keep-manifests 100 \
  --keep-since 24h \
  --dry-run

Reports how many Objects would be reclaimed. At billion-scale, run with --shard N --of M across multiple k8s pods.

12. Delete a record (tombstone)

GDPR-style suppression: read paths skip the named anchor; storage compaction is a separate operator pass (deferred per spec/0020 §6).

bash
# Pick an anchor from step 7's query output — e.g. 1779090000000000456.
./target/release/dreamdb delete \
  --backend http://localhost:9000/tutorial \
  --ref-name multimodal \
  --reason gdpr \
  1779090000000000456

Or from Python:

python
ds = vd.Dataset.open_ref("multimodal", backend=BACKEND)
new_hash = ds.delete([1779090000000000456], reason="gdpr")
print("tombstone set now:", ds.tombstone_set())

Now re-run the semantic query from step 7. The deleted anchor is gone. But re-run it against the baseline-200 snapshot — the record is still there, because that snapshot pins the Manifest from before the delete:

python
old = vd.Dataset.open_ref("baseline-200", backend=BACKEND)
batches = old.iter_vector(field="embedding", query=q.tolist(), top_k=10)
# the deleted anchor still appears here

This is DreamDB's time-travel property in action: deletion doesn't rewrite history, it adds a Manifest with a dreamdb.tombstones registry entry. Old snapshots that don't reference the tombstone see the original data.

13. Sharded ingest with merge-many

For real-scale ingest, run N workers each ingesting into their own branch, then merge them all into trunk in one shot.

Worker 0 -- branch and append its slice:

python
#!/usr/bin/env -S uv run
# /// script
# dependencies = ["numpy", "dreamdb-dataset"]
# ///
import numpy as np, dreamdb_dataset as vd
ds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")
w = ds.branch("ingest-w-0")
rng = np.random.default_rng(0)
w.append_many([{"image": b"fake-jpeg-0", "embedding": rng.standard_normal(512).astype(np.float32),
                "label": f"w0-{i}"} for i in range(50)])

Worker 1 -- same, different slice:

python
#!/usr/bin/env -S uv run
# /// script
# dependencies = ["numpy", "dreamdb-dataset"]
# ///
import numpy as np, dreamdb_dataset as vd
ds = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")
w = ds.branch("ingest-w-1")
rng = np.random.default_rng(1)
w.append_many([{"image": b"fake-jpeg-1", "embedding": rng.standard_normal(512).astype(np.float32),
                "label": f"w1-{i}"} for i in range(50)])

Orchestrator merges both branches into trunk in one shot -- either via CLI:

bash
./target/release/dreamdb merge-many \
  --backend http://localhost:9000/tutorial \
  --ref-name multimodal \
  ingest-w-0 ingest-w-1

...or from Python:

python
import dreamdb_dataset as vd
trunk = vd.Dataset.open_ref("multimodal", backend="http://localhost:9000/tutorial")
final_manifest = trunk.merge_many(["ingest-w-0", "ingest-w-1"])
print(f"new trunk manifest: {final_manifest}")

The orchestrator publishes a new Manifest with parents = [trunk_tip, w0_tip, w1_tip] (chained sequentially in v0). At 10B scale you'd run 64 workers in parallel against a k8s Job; the merge-many step takes ~minutes regardless. See design/0007-sharded-ingest.md for the algorithm and the k8s YAML pattern.

14. Moving to S3 (optional)

When local MinIO outgrows your laptop disk, the path to production is just env vars — no code changes:

bash
export AWS_ACCESS_KEY_ID="AKIA…"
export AWS_SECRET_ACCESS_KEY="…"
export AWS_REGION="us-east-1"

# Same Python / CLI commands as above; just point at the S3 bucket.
python -c 'import dreamdb_dataset as vd; print(vd.Dataset.open_ref("multimodal", backend="https://s3.us-east-1.amazonaws.com/my-bucket").count())'

The connector auto-detects the env vars and signs every request with SigV4. Works against AWS S3, Cloudflare R2, Backblaze B2, Wasabi, and any other S3-compatible endpoint. For R2, set AWS_REGION=auto and use your R2 endpoint URL.

15. What now?

You've used the full DreamDB surface. Here's where to read next:

  • For the protocol: spec/0000-overview.md is the entry point; INDEX.md is the navigation index for all 21 specs.
  • For real ingest at scale: design/0006-10b-scale-blockers.md (everything that makes DreamDB 10B-ready).
  • For sharded ingest mechanics: design/0007-sharded-ingest.md.
  • For the Rust SDK: start at dreamdb-dataset/src/dataset.rs; the public surface is small and well-documented.
  • For operator workflows: dreamdb --help lists every CLI verb.

Cleanup

bash
docker rm -f dreamdb-tutorial-minio
docker volume rm $(docker volume ls -q | grep minio) 2>/dev/null || true

The tutorial bucket and all 200 samples + 1000 random vectors are gone. DreamDB itself is just files on disk; nothing else persists.

Status notes

This tutorial reflects the v0 reference implementation as of 2026-05-18. Two things to know:

  • iter_arrow_batches materializes the whole dataset into memory before yielding. For 10B-scale Python streaming, use Dataset.iter_stream(batch_size, fields) instead — it returns a true generator (StreamBatchIter) backed by a Rust mpsc channel, with bounded per-batch RAM. Embedding-only in v0; multi-modality merge-join is the next streaming-iter slice.
  • CLIP-ViT-B/32 on M-series MPS runs at ~200-300 samples/sec — that's the GPU forward-pass limit, not a DreamDB bottleneck. Production ingest uses sharded workers (step 13).