Design 0001 — Dataset Platform on DreamDB
Status: Implementation shipping. Phases 1–5 (10B-scale blocker push) complete as of 2026-05-18. See design/0006-10b-scale-blockers.md for the post-implementation summary; this doc is preserved as the original Phase-0 architecture sketch.
Last updated: 2026-05-11 (original architecture sketch); status banner refreshed 2026-05-18.
Owner: —
What's accurate in this doc as of 2026-05-18: the conceptual model (multimodal data lake on DreamDB, ImageNet-100 as reference dataset, Tracks-per-field schema), and the original Phase-0 plan. What's outdated: phase markers in the "Coverage today vs gap" table — most "Phase N closes the gap" rows have shipped. For the current state, read
design/0006-10b-scale-blockers.md(B1-B8 all ✅) and the README's status table. Current corpus is 231K imagenet-100 + 1.33M imagenet-1k (ingest in flight).
What we're building
A versioned multimodal data lake for ML training, with DreamDB as the storage substrate.
Users upload raw datasets (images, audio, text, embeddings, scalar labels) to a Dataset. They fetch subsets matching arbitrary filters — vector similarity, time/version range, structured metadata, random/stratified samples — into a streaming PyTorch DataLoader.
Comparable products: Activeloop Deep Lake, HuggingFace Datasets, Pachyderm, DVC. The differentiator is DreamDB underneath: content-addressed storage gives free dedup + branching; the same protocol handles every modality on one timeline; the same query path works at 1M and 1B vectors.
Reference dataset
ImageNet-100 — 100-class subset of ImageNet-1K. ~130K train + ~5K val images, JPEG, ~13 GB total. Provides:
- Image blobs (variable size, KB-MB each) — tests Fragment/blob storage.
- Per-image categorical label (
class) — tests scalar metadata. - Train/val split — tests another categorical filter.
- Source:
clane9/imagenet-100on HuggingFace (downloading in parallel to this doc).
We'll generate embeddings as a separate offline step (default: pretrained ResNet-50 via the Hugging Face transformers Python lib) and store them as a parallel field in the Dataset. We do NOT ship an embedding model in the SDK; embedding generation is the user's responsibility.
Coverage today vs gap
| Need | Today in DreamDB | Phase that closes the gap |
|---|---|---|
| Multimodal storage on one timeline | ✓ Tracks (Continuous Signal / Discrete Event / Global Constant per spec/0001) | — |
| Content-addressed blobs (images, audio) | ✓ Fragment Tracks (spec/0007 §4) | — |
| Vector similarity filter | ✓ dreamdb.lsh-cosine / dreamdb.ivf-cosine / dreamdb.imi-cosine | — |
| Time-range / dataset-version filter | ✓ Manifest DAG + Refs | High-level wrapper in Phase 1 |
Structured metadata filter (WHERE label='cat') | ✗ No scalar-index modality | Phase 2 (this is the spec contribution) |
| Random / stratified sample | ✗ No primitive | Phase 4 (built on top of enumerate APIs) |
| Python integration (PyTorch / JAX / TF) | ✗ Rust-only | Phase 3 (PyO3 bindings) |
| Bulk upload of large blobs | Partial — Fragments work, no multipart | Phase 4 (multipart on the connector) |
Phasing
| Phase | Deliverable | Wall-time est. |
|---|---|---|
| 0 | This doc + ImageNet-100 download. | a few hours |
| 1 | dreamdb-dataset Rust crate: Dataset::create / open / append / iter. Reference CLI app: ingest ImageNet-100 (no metadata filter yet); search by vector similarity; stream batches. Validates the high-level SDK shape. | ~1 week |
| 2 | Native scalar-index modality: spec/0011 + protocol implementation + bench validation. Enables WHERE label=... in Dataset::iter. | ~2 weeks |
| 3 | Python bindings via PyO3. Module structure mirrors Rust SDK; IterableDataset adapter for PyTorch DataLoader; multi-worker shard-deterministic iteration. | ~1 week |
| 4 | Multipart upload, random/stratified sampling primitives, distributed sharding for multi-worker DataLoader. | ~1-2 weeks |
Total: ~5-6 weeks engineering for the full product. The Rust SDK is usable end-to-end after Phase 2; Python after Phase 3.
Phase 1: SDK shape (Rust)
Crate layout
Public API sketch
Schema and field-to-Track mapping
Each Field type maps to a DreamDB Track:
Image/Audio→ Fragment Track (one Fragment per blob, addressed by content hash → free dedup).Text→ Discrete Event Track (small payloads, time-bucketed).Embedding→ Spatial-Bucket Track with the embedding's algorithm in the registry.Scalar→ Scalar-Index Track (Phase 2's new modality).
A Sample is a tuple of refs across these per-field Tracks, joined by a sample id (a u64 we mint on append). The sample-id → per-field-Object mapping lives in a per-Dataset "join Track" (probably another Discrete Event Track keyed by sample id).
Filter execution
iter(filter, batch_size):
- Decompose the filter AST: identify which clauses are index-amenable (
Vector,Where,TimeRange) vs. requires-full-scan (catch-all). - Execute each index lookup against its Track in parallel; intersect the resulting sample-id sets.
- For each sample id in the intersection, fetch the requested fields from their Tracks (parallel
BucketReferenceresolution). - Yield in batches of
batch_size, with optional shuffle (deterministic from a(epoch, worker_id)seed for multi-worker iteration).
The filter planner is the conceptual heart of this crate. Phase 1 ships a no-op planner that requires the user to express filters in a single-clause shape (e.g. just a vector query, or just a time range). Phase 2 adds intersection. Phase 4 adds the sampling primitives.
Phase 2: Native scalar-index modality (the spec contribution)
This is the only piece of DreamDB that doesn't have an obvious existing extension path. The clean answer: treat scalar fields as their own Track-with-spatial-index pair, parallel to vector tracks.
Sketch
A new modality string: scalar.<value-type> (e.g. scalar.string-categorical, scalar.int64, scalar.timestamp).
A new SpatialIndex algorithm family — but "spatial" is the wrong word here, it's a 1-D scalar index. So either we generalize the SpatialIndex Object to "IndexObject," or we add a new sibling concept "ScalarIndexObject."
Cleanest: add a new Track index variant ObjectIndex::ScalarBucket(_), parallel to SpatialBucket. The bucket records carry (scalar_value, sample_id, time_anchor) tuples sorted by scalar_value. Lookups by value range descend the same B-tree of Index Pages we already use for paged tracks.
Three algorithm flavors to ship:
| Algorithm ID | Use case | Storage | Lookup cost |
|---|---|---|---|
dreamdb.btree-int64 | Integer / timestamp ranges | Sorted (value, sample_id) pairs in leaves | O(log N) |
dreamdb.btree-string | String categorical / lexical | Sorted (value, sample_id) pairs | O(log N) |
dreamdb.bitmap-categorical | Low-cardinality categorical (label, split) with very common in-set queries | Roaring bitmap per category value | O(cardinality) for index, O(1) per match |
Bitmap is the obvious choice for ImageNet-100's label (100 distinct values, queries like label='cat' resolve to "AND the bitmap for 'cat' with everything else"). B-tree handles wider ranges.
Open spec questions to resolve in Phase 2:
- How are scalar values written? Inline in the bucket (like vector data) or referenced (like the VS Object pattern)?
- Multi-version semantics: when a sample is overwritten, does the scalar index hold both? DreamDB's append-only semantics suggest yes — all versions are queryable, default reader shows latest.
- Cardinality threshold for bitmap-vs-btree auto-selection.
This work lands as spec/0011-scalar-indexing.md and corresponding code in dreamdb-protocol/src/scalar_index.rs + new BucketRecord variants. Should follow the same template as our IVF/IMI work — algorithm + tests + spec section + bench validation.
Phase 3: Python bindings
PyO3-based Python module. Mirror the Rust API one-to-one where possible:
The as_iterable_dataset adapter is what makes DreamDB usable in real training scripts. It implements PyTorch's IterableDataset, with shard-deterministic iteration so that num_workers=N divides the filtered set into N disjoint streams (no worker sees the same sample twice within an epoch).
Phase 4: Polish
- Multipart upload: large videos / audio files exceed S3's 5 GB single-PUT cap. The connector layer needs
start_multipart,upload_part,complete_multipart. Already mentioned inspec/0005as future work. - Random sampling:
Filter::RandomSample { count, seed }translates into a stream of sample-id picks via reservoir sampling over the scalar-index leaves. - Stratified sampling: same but bucketed by a categorical field; pulls
count / num_stratafrom each bucket. - Distributed sharding: when
num_workers > 1, each worker only iterates over its assigned shard of the filtered sample-id set. Determinism requires the filter-evaluation order to be stable across workers — straightforward if the filter resolves to a sorted sample-id list.
Risks and open questions
| Risk | Notes |
|---|---|
| Embedding generation is out of scope but every real dataset needs it. | Document the recommended pattern (run a separate dreamdb-dataset embed --model resnet50 step before upload). Don't bundle the model. |
| Phase 2's scalar-index is a real spec contribution; could expand to its own multi-week effort. | Start with bitmap-only (the ImageNet-100 case); B-tree comes after. |
| Python multi-worker DataLoader semantics are subtle (epoch boundaries, shuffling determinism, worker re-seeding). | Crib from HuggingFace datasets and PyTorch IterableDataset examples. The Rust side just needs to expose enough primitives (sharded iteration with deterministic seeds) for the Python layer to compose them. |
| PyO3 wraps Rust async into Python sync clumsily. | Likely solution: each Python iter() call holds a Tokio runtime internally and block_ons. Avoids leaking async into Python. |
Versioning UX: do users see dataset@v1.2 or dataset@<commit-hash>? | Default to dataset@v1.2 (snapshots are user-named labels mapped to Manifest hashes). Hashes always work as a fallback. |
Decision log
| Decision | Choice | Why |
|---|---|---|
| Reference dataset | ImageNet-100 | Real scale, real metadata, common ML benchmark |
| Scalar-metadata path | Native (option 1) | Long-arc consistency; keeps the data plane single-system |
| Python on the critical path? | Yes | PyTorch DataLoader is non-negotiable for ML adoption |
| Embedding generation in SDK? | No (recommend external step) | Keeps the SDK focused; embedding choices vary by user |
Next concrete step
Once the ImageNet-100 download finishes (in flight as of this doc), Phase 1 can begin: scaffold dreamdb-dataset crate with the API skeleton above, and write a CLI app that ingests ImageNet-100 train split + queries by vector similarity. No metadata filter yet — that's Phase 2.
Phase 1 is also the moment to decide if Sample joins via a separate "join Track" (sample-id ↔ per-field references) or via shared time anchors across Tracks. The latter is simpler but locks us out of per-sample updates. Worth deciding before writing too much code.