Design 0001 — Dataset Platform on DreamDB

Status: Implementation shipping. Phases 1–5 (10B-scale blocker push) complete as of 2026-05-18. See design/0006-10b-scale-blockers.md for the post-implementation summary; this doc is preserved as the original Phase-0 architecture sketch. Last updated: 2026-05-11 (original architecture sketch); status banner refreshed 2026-05-18. Owner: —

What's accurate in this doc as of 2026-05-18: the conceptual model (multimodal data lake on DreamDB, ImageNet-100 as reference dataset, Tracks-per-field schema), and the original Phase-0 plan. What's outdated: phase markers in the "Coverage today vs gap" table — most "Phase N closes the gap" rows have shipped. For the current state, read design/0006-10b-scale-blockers.md (B1-B8 all ✅) and the README's status table. Current corpus is 231K imagenet-100 + 1.33M imagenet-1k (ingest in flight).

What we're building

A versioned multimodal data lake for ML training, with DreamDB as the storage substrate.

Users upload raw datasets (images, audio, text, embeddings, scalar labels) to a Dataset. They fetch subsets matching arbitrary filters — vector similarity, time/version range, structured metadata, random/stratified samples — into a streaming PyTorch DataLoader.

Comparable products: Activeloop Deep Lake, HuggingFace Datasets, Pachyderm, DVC. The differentiator is DreamDB underneath: content-addressed storage gives free dedup + branching; the same protocol handles every modality on one timeline; the same query path works at 1M and 1B vectors.

Reference dataset

ImageNet-100 — 100-class subset of ImageNet-1K. ~130K train + ~5K val images, JPEG, ~13 GB total. Provides:

Image blobs (variable size, KB-MB each) — tests Fragment/blob storage.
Per-image categorical label (class) — tests scalar metadata.
Train/val split — tests another categorical filter.
Source: clane9/imagenet-100 on HuggingFace (downloading in parallel to this doc).

We'll generate embeddings as a separate offline step (default: pretrained ResNet-50 via the Hugging Face transformers Python lib) and store them as a parallel field in the Dataset. We do NOT ship an embedding model in the SDK; embedding generation is the user's responsibility.

Coverage today vs gap

Need	Today in DreamDB	Phase that closes the gap
Multimodal storage on one timeline	✓ Tracks (Continuous Signal / Discrete Event / Global Constant per `spec/0001`)	—
Content-addressed blobs (images, audio)	✓ Fragment Tracks (`spec/0007 §4`)	—
Vector similarity filter	✓ `dreamdb.lsh-cosine` / `dreamdb.ivf-cosine` / `dreamdb.imi-cosine`	—
Time-range / dataset-version filter	✓ Manifest DAG + Refs	High-level wrapper in Phase 1
Structured metadata filter (`WHERE label='cat'`)	✗ No scalar-index modality	Phase 2 (this is the spec contribution)
Random / stratified sample	✗ No primitive	Phase 4 (built on top of enumerate APIs)
Python integration (PyTorch / JAX / TF)	✗ Rust-only	Phase 3 (PyO3 bindings)
Bulk upload of large blobs	Partial — Fragments work, no multipart	Phase 4 (multipart on the connector)

Phasing

Phase	Deliverable	Wall-time est.
0	This doc + ImageNet-100 download.	a few hours
1	`dreamdb-dataset` Rust crate: `Dataset::create / open / append / iter`. Reference CLI app: ingest ImageNet-100 (no metadata filter yet); search by vector similarity; stream batches. Validates the high-level SDK shape.	~1 week
2	Native scalar-index modality: spec/0011 + protocol implementation + bench validation. Enables `WHERE label=...` in `Dataset::iter`.	~2 weeks
3	Python bindings via PyO3. Module structure mirrors Rust SDK; `IterableDataset` adapter for PyTorch DataLoader; multi-worker shard-deterministic iteration.	~1 week
4	Multipart upload, random/stratified sampling primitives, distributed sharding for multi-worker DataLoader.	~1-2 weeks

Total: ~5-6 weeks engineering for the full product. The Rust SDK is usable end-to-end after Phase 2; Python after Phase 3.

Phase 1: SDK shape (Rust)

Crate layout

dreamdb-dataset/
├── Cargo.toml
└── src/
    ├── lib.rs          # public Dataset, Sample, Filter, etc.
    ├── builder.rs      # SampleBuilder for staged uploads
    ├── filter.rs       # Filter AST + execution planner
    ├── iter.rs         # batch iterator + shuffle/shard logic
    └── version.rs      # snapshot, branch, version naming on top of Refs

Public API sketch

rust

pub struct Dataset {
    timeline: Multihash,
    ref_name: String,
    session: Arc<Session>,
}

impl Dataset {
    pub async fn create(name: &str, schema: Schema, conn: Arc<dyn Connector>) -> Result<Self>;
    pub async fn open(name: &str, conn: Arc<dyn Connector>) -> Result<Self>;

    /// Return a builder; .field(name, value) per field; .commit() to write.
    pub fn append(&mut self) -> SampleBuilder<'_>;

    /// Bulk path — iterator of Samples, batched & flushed.
    pub async fn append_iter<I: Iterator<Item = Sample>>(&mut self, samples: I) -> Result<u64>;

    /// Stream batches matching a Filter.
    pub fn iter(&self, filter: Filter, batch_size: usize) -> impl Stream<Item = Result<Batch>>;

    /// Pin the current Manifest under a label; subsequent reads at this
    /// version are stable even as the dataset grows.
    pub async fn snapshot(&mut self, label: &str) -> Result<DatasetVersion>;

    /// Branch from a version — new ref pointing at the same Manifest.
    pub async fn branch(&self, from: &DatasetVersion, new_name: &str) -> Result<Dataset>;
}

pub struct Sample {
    pub fields: HashMap<String, Field>,
}

pub enum Field {
    Image(Bytes),                  // JPEG/PNG bytes; modality "image.jpeg"
    Audio(Bytes),                  // WAV/MP3 bytes
    Text(String),
    Embedding { algorithm: String, vector: Vec<f32> },
    Scalar(ScalarValue),           // for structured metadata
}

pub enum ScalarValue {
    Int(i64),
    Float(f64),
    Bool(bool),
    String(String),
    Categorical(String),           // string with the hint that it's an enum
    Timestamp(i64),                // ns since epoch
}

pub enum Filter {
    All,
    And(Vec<Filter>),
    Or(Vec<Filter>),
    Not(Box<Filter>),
    Vector { field: String, query: Vec<f32>, top_k: usize },
    TimeRange { start_ns: i64, end_ns: i64 },
    Where { field: String, op: ScalarOp, value: ScalarValue },
    RandomSample { count: usize, seed: u64 },        // Phase 4
    StratifiedSample { count: usize, by: String, seed: u64 }, // Phase 4
}

pub enum ScalarOp { Eq, Neq, Lt, Lte, Gt, Gte, In, Contains }

Schema and field-to-Track mapping

Each Field type maps to a DreamDB Track:

Image / Audio → Fragment Track (one Fragment per blob, addressed by content hash → free dedup).
Text → Discrete Event Track (small payloads, time-bucketed).
Embedding → Spatial-Bucket Track with the embedding's algorithm in the registry.
Scalar → Scalar-Index Track (Phase 2's new modality).

A Sample is a tuple of refs across these per-field Tracks, joined by a sample id (a u64 we mint on append). The sample-id → per-field-Object mapping lives in a per-Dataset "join Track" (probably another Discrete Event Track keyed by sample id).

Filter execution

iter(filter, batch_size):

Decompose the filter AST: identify which clauses are index-amenable (Vector, Where, TimeRange) vs. requires-full-scan (catch-all).
Execute each index lookup against its Track in parallel; intersect the resulting sample-id sets.
For each sample id in the intersection, fetch the requested fields from their Tracks (parallel BucketReference resolution).
Yield in batches of batch_size, with optional shuffle (deterministic from a (epoch, worker_id) seed for multi-worker iteration).

The filter planner is the conceptual heart of this crate. Phase 1 ships a no-op planner that requires the user to express filters in a single-clause shape (e.g. just a vector query, or just a time range). Phase 2 adds intersection. Phase 4 adds the sampling primitives.

Phase 2: Native scalar-index modality (the spec contribution)

This is the only piece of DreamDB that doesn't have an obvious existing extension path. The clean answer: treat scalar fields as their own Track-with-spatial-index pair, parallel to vector tracks.

Sketch

A new modality string: scalar.<value-type> (e.g. scalar.string-categorical, scalar.int64, scalar.timestamp).

A new SpatialIndex algorithm family — but "spatial" is the wrong word here, it's a 1-D scalar index. So either we generalize the SpatialIndex Object to "IndexObject," or we add a new sibling concept "ScalarIndexObject."

Cleanest: add a new Track index variant ObjectIndex::ScalarBucket(_), parallel to SpatialBucket. The bucket records carry (scalar_value, sample_id, time_anchor) tuples sorted by scalar_value. Lookups by value range descend the same B-tree of Index Pages we already use for paged tracks.

Three algorithm flavors to ship:

Algorithm ID	Use case	Storage	Lookup cost
`dreamdb.btree-int64`	Integer / timestamp ranges	Sorted (value, sample_id) pairs in leaves	O(log N)
`dreamdb.btree-string`	String categorical / lexical	Sorted (value, sample_id) pairs	O(log N)
`dreamdb.bitmap-categorical`	Low-cardinality categorical (label, split) with very common in-set queries	Roaring bitmap per category value	O(cardinality) for index, O(1) per match

Bitmap is the obvious choice for ImageNet-100's label (100 distinct values, queries like label='cat' resolve to "AND the bitmap for 'cat' with everything else"). B-tree handles wider ranges.

Open spec questions to resolve in Phase 2:

How are scalar values written? Inline in the bucket (like vector data) or referenced (like the VS Object pattern)?
Multi-version semantics: when a sample is overwritten, does the scalar index hold both? DreamDB's append-only semantics suggest yes — all versions are queryable, default reader shows latest.
Cardinality threshold for bitmap-vs-btree auto-selection.

This work lands as spec/0011-scalar-indexing.md and corresponding code in dreamdb-protocol/src/scalar_index.rs + new BucketRecord variants. Should follow the same template as our IVF/IMI work — algorithm + tests + spec section + bench validation.

Phase 3: Python bindings

PyO3-based Python module. Mirror the Rust API one-to-one where possible:

python

import dreamdb

# Open / create
ds = dreamdb.Dataset.open("imagenet-100", backend="file:///data/datasets")

# Append
with ds.append() as s:
    s.image = open("cat.jpg", "rb").read()
    s.label = "cat"
    s.split = "train"
    s.embedding = np.array([0.1, 0.2, ...], dtype=np.float32)

# Filter + iterate
for batch in ds.iter(
    filter=dreamdb.f.And([
        dreamdb.f.Where("label", "==", "cat"),
        dreamdb.f.Where("split", "==", "train"),
        dreamdb.f.Vector("embedding", query, top_k=1000),
    ]),
    batch_size=64,
    shuffle=True,
    seed=42,
):
    images = batch["image"]   # list of bytes (decode externally)
    labels = batch["label"]   # list of str
    embed  = batch["embedding"]  # np.ndarray (batch_size, dim)

# PyTorch DataLoader integration
from torch.utils.data import DataLoader
torch_ds = ds.as_iterable_dataset(filter=..., transform=lambda b: ...)
loader = DataLoader(torch_ds, batch_size=64, num_workers=4, ...)

The as_iterable_dataset adapter is what makes DreamDB usable in real training scripts. It implements PyTorch's IterableDataset, with shard-deterministic iteration so that num_workers=N divides the filtered set into N disjoint streams (no worker sees the same sample twice within an epoch).

Phase 4: Polish

Multipart upload: large videos / audio files exceed S3's 5 GB single-PUT cap. The connector layer needs start_multipart, upload_part, complete_multipart. Already mentioned in spec/0005 as future work.
Random sampling: Filter::RandomSample { count, seed } translates into a stream of sample-id picks via reservoir sampling over the scalar-index leaves.
Stratified sampling: same but bucketed by a categorical field; pulls count / num_strata from each bucket.
Distributed sharding: when num_workers > 1, each worker only iterates over its assigned shard of the filtered sample-id set. Determinism requires the filter-evaluation order to be stable across workers — straightforward if the filter resolves to a sorted sample-id list.

Risks and open questions

Risk	Notes
Embedding generation is out of scope but every real dataset needs it.	Document the recommended pattern (run a separate `dreamdb-dataset embed --model resnet50` step before upload). Don't bundle the model.
Phase 2's scalar-index is a real spec contribution; could expand to its own multi-week effort.	Start with bitmap-only (the ImageNet-100 case); B-tree comes after.
Python multi-worker DataLoader semantics are subtle (epoch boundaries, shuffling determinism, worker re-seeding).	Crib from HuggingFace `datasets` and PyTorch `IterableDataset` examples. The Rust side just needs to expose enough primitives (sharded iteration with deterministic seeds) for the Python layer to compose them.
PyO3 wraps Rust async into Python sync clumsily.	Likely solution: each Python `iter()` call holds a Tokio runtime internally and `block_on`s. Avoids leaking async into Python.
Versioning UX: do users see `dataset@v1.2` or `dataset@<commit-hash>`?	Default to `dataset@v1.2` (snapshots are user-named labels mapped to Manifest hashes). Hashes always work as a fallback.

Decision log

Decision	Choice	Why
Reference dataset	ImageNet-100	Real scale, real metadata, common ML benchmark
Scalar-metadata path	Native (option 1)	Long-arc consistency; keeps the data plane single-system
Python on the critical path?	Yes	PyTorch DataLoader is non-negotiable for ML adoption
Embedding generation in SDK?	No (recommend external step)	Keeps the SDK focused; embedding choices vary by user

Next concrete step

Once the ImageNet-100 download finishes (in flight as of this doc), Phase 1 can begin: scaffold dreamdb-dataset crate with the API skeleton above, and write a CLI app that ingests ImageNet-100 train split + queries by vector similarity. No metadata filter yet — that's Phase 2.

Phase 1 is also the moment to decide if Sample joins via a separate "join Track" (sample-id ↔ per-field references) or via shared time anchors across Tracks. The latter is simpler but locks us out of per-sample updates. Worth deciding before writing too much code.