10B-Scale Blocker Audit
2026-05-15. Target: dataset with 10⁹–10¹⁰ records works for the canonical ML workflow (ingest, train, sweep, GC, query). Every blocker gets a line; ordered by impact.
What "10B scale" actually means in numbers
| Resource | 10B-record budget (CLIP-style: image + 512-dim emb + label) |
|---|---|
| Raw embeddings (f32) | 10⁹ × 512 × 4 = 2.0 TB |
| Compressed embeddings (RaBitQ 1-bit + corrections) | 10⁹ × 76 = 76 GB |
| Image bytes (avg 50 KB JPEG) | 10⁹ × 50 KB = 50 TB |
| Labels (avg 30 B string) | 10⁹ × 30 = 30 GB |
| IVF cells at k=√N | √(10⁹) = 31,623 cells |
| Bucket Objects (60-record avg) | 16.7M buckets |
| Manifest size (inline) | impossible — must page |
| Index Pages at fanout=100 | 16.7M / 1000 leaves ≈ 17K leaf pages + 170 internal |
| Anchor set in memory (u64) | 10⁹ × 8 = 8 GB |
| GC list operation | 10⁹ × header bytes + manifest walk |
(For "10× more" — 100B records — multiply by 10.)
Hard architectural rules at 10B:
- Nothing materializes O(N) in RAM
- Every O(N) operation parallelizes across workers
- Single-machine work bounded to O(log N) or O(√N)
- Total elapsed for any single op ≤ wall-clock budget (typically 1 hour)
Blocker inventory (by severity)
🔴 P0 blockers — workflow OOMs or never completes
B1. Eager iter (Dataset::iter returns Vec<Batch>).
- Current: materializes ALL records in RAM before yielding any batch
- 10B impact: would need 2 TB of RAM for embeddings alone. OOM at any single machine.
- Live evidence: at 231K records, peak RSS 1.5 GB (5× the raw data because of Vec<Field::Embedding>)
- Fix:
Dataset::iter_stream(...) -> impl Stream<Item=Result<Batch>>. Bounded RAM per batch. - Status: ✅ Rust shipped (2026-05-15) + Python streaming shipped 2026-05-18 (B1.5). Embedding-only MVP; multi-modality merge-join + scalar streaming deferred. Live test: streaming 5120 records of 231K dataset at 112/s with bounded RAM. Python now has real
Dataset.iter_stream(batch_size, fields, channel_buffer)returning aStreamBatchIter(PyO3 iterator with__iter__+__next__). Background tokio task drives the Rust stream into an mpsc channel;__next__block_on's the next batch. Verified: 50 records / batch_size=10 → 5 batches yielded one-at-a-time; early-break (del it) cleanly stops the spawned task.
B2. Sharded ingest doesn't exist.
- Current:
Dataset::append_manyis single-process. CLIP at ~120 samples/s × 8.6M seconds = 99 days to ingest 10B images. - 10B impact: dataset never gets built in the first place.
- Fix: Workers ingest disjoint slices to per-worker BRANCHES; orchestrator merges branches via fast-forward or multi-parent. Reuses existing branch/merge primitives. ~600 LOC + spec for multi-parent merge.
- Status: ✅ Shipped (2026-05-18).
MergeStrategy::UnionTracks+Dataset::merge_many+dreamdb merge-manyCLI. The 3-way union-merge walks both parent chains to find LCA, then for each embedding TrackEntry runs cell-by-cell reconciliation: per cell, take trunk (if branch unchanged) / take branch (if trunk unchanged) / fetch+merge buckets (if both diverged). Bucket merge unions records bytime_anchor, refusing loudly on slice-assignment errors (same anchor on both sides with different vectors). New multi-parent Manifest hasparents = [trunk_tip, branch_tip]. v0 limitations: SpatialBucket inline tracks only; non-embedding tracks (Fragment/Scalar/Constant) must be identical across trunk and branch. 3 new tests (union_merge_combines_disjoint_branch_appends,union_merge_no_op_when_same_tip,merge_many_combines_three_branches); also fixed an existing latent bug: FastForward merge wasn't refreshingfield_tracks, so post-FF iter() returned stale records. AddedDataset::refresh_field_tracks_from_currentcalled after both FF and UnionTracks. 721 total green.
B3. Sharded redispatch in ada-ivf-step.
- Current: orchestrator does the redispatch serially. 10B records × ~1 ms/record = 2.8 hours single-machine.
- 10B impact: maintenance ops take half a day each.
- Fix: Phase 3.3 (workers redispatch their slice in parallel, orchestrator stitches paged Track leaves). 4-stage k8s Job. ~400 LOC + docs.
- Status: ✅ Shipped (2026-05-18). 4-stage pipeline: (1) centroid workers via
--shard N --of M(existing); (2)--orchestrate --orchestrate-phase=publish-siaggregates centroids and publishes new SI Object +_ada_ivf/<job_id>/new_si.jsonmarker; (3)--redispatch-shard N --of Mworkers read the marker, redispatch their owned cell slice (cell_id % M == N), emit per-shard bucket-entries JSON; (4)--orchestrate --orchestrate-phase=finalizeaggregates redispatch shard JSONs, builds Track (auto-paged at >8000 entries), publishes Manifest, CAS Ref. Single-machine mode (no phase, no redispatch-shard) still works end-to-end (used by 90% of operators). 705 tests green.
B4. dreamdb-cli gc walks every object serially.
- Current: serial list-prefix + parallel HEAD (32 concurrent) + parallel DELETE (32 concurrent).
- 10B impact: 10⁹+ list entries to enumerate. At 1000 list ops/s = 11 days just to list.
- Fix: prefix-sharded GC — N workers handle disjoint backend key prefixes. Each enumerates + HEADs + DELETEs its slice in parallel. ~300 LOC.
- Status: ✅ Shipped (2026-05-18).
dreamdb gc --shard N --of M: each worker does the full LIST + live-set walk (cheap; required for correctness), then partitions candidates by leading-64-bits-of-multihash mod M. Only owns its slice for HEAD + DELETE. Live verified on imagenet-100-rabitq-corrected: 4 shards split 84,395 candidates as 20766/21007/21355/21267 (variance <2%). 3 new unit tests for partition uniformity. The LIST itself is single-worker per pod (S3 list_objects_v2 paginates internally; ~1000-50000 entries/s); HEAD+DELETE are the actual 10B bottleneck and they parallelize M× cleanly. Sub-LIST prefix-shard parallelism (multi-prefix LIST per worker) deferred — not on the critical path.
B5. Per-anchor join loop in iter_time_range is serial.
- Current: for each anchor, fetch blob bytes; serial loop.
- 10B impact: 10⁹ × 5 ms HTTP RTT = 57 days just for blob fetches.
- Fix: parallelize the inner per-anchor blob fetch with
buffer_unordered(64). Same pattern as P4.0c (embedding fetch). - Status: ✅ Shipped (2026-05-15). Per-chunk pre-fetch via
buffer_unordered(64)builds an anchor→bytes map; per-anchor join is now in-memory lookup.
🟡 P1 blockers — works but degraded at scale
B6. Single-process IVF dispatch (hash_vector) per query.
- Current: O(k × dim) per record. At k=31K, dim=512 → 16M multiplies = ~5 ms per query on a single CPU core.
- 10B impact: 200 q/s/core query throughput. Probably acceptable but tight.
- Fix A: parallel hash_vector via rayon (5-8× speedup on 8-core).
- Fix B: switch to IMI partitioning (k_sub=√k ≈ 177; per-query cost √k × dim = 90K muls = ~30 µs per query). 100× speedup. Algorithm already in protocol; not used in production datasets.
- Status: ✅ Shipped (2026-05-18). Fix A:
IvfCosine::compute_dotsparallelized via rayon whenk * dim > 512K flops(below threshold the serial path runs to avoid fork-join overhead for small training-scale k). Per-query latency at 10B-scale drops from ~5 ms to ~700 µs on 8-core. Fix B (IMI in production): already plumbed throughSpatialDispatcher::Imiin dreamdb-dataset; choosingalgorithm: "dreamdb.imi-cosine"at schema-create time activates the √k × dim path automatically. No additional code needed. 1 new determinism test (parallel_compute_dots_matches_serial).
B7. Manifest size at 10B.
- Current: inline tracks paged when > 1 MiB; Manifest's
tracksfield could itself be paged but I haven't verified. - 10B impact: with k=31K + 4 fields × ~120 B per Track entry, Manifest tracks list is ~15 KB. Inline-fine. Per-field paged Tracks have many leaf pages; Manifest is small either way.
- Fix: probably no action needed; audit to confirm Manifest stays under 1 MiB at 10B.
- Status: ✅ Verified (2026-05-18) — no action needed. Math: 10B records × 1 timeline × ~5 fields (image, embedding, label, +1-2 layered scalars). Manifest contains: (a)
tracksfield = 5 × ~150 B per TrackEntry = ~750 B; (b)registry= 5 modalities × ~100 B + schema CBOR (~1 KB) +dreamdb.tombstoneshead (~50 B) = ~1.5 KB; (c)parents= 1-N × 33 B. Total Manifest ≈ 2.5 KB at 10B records. Headroom factor: 400×. The data volume lives in PAGED Tracks/Buckets that the Manifest REFERENCES, not in the Manifest itself. Inlinetrackspaging (already supported per spec/0008 §4) would only matter for pathological cases (thousands of distinct modalities); not a 10B-scale concern in any normal workflow.
B8. No tombstones / deletion.
- Current: append-only; deletion is a full data rewrite.
- 10B impact: any single GDPR request requires rewriting the entire dataset. Untenable.
- Fix: tombstone primitive in spec/0020. Per-modality
dreamdb.tombstonesregistry entry of(ordinal, anchor_hash)pairs. Query path skips matched records. GC eventually compacts. - Status: ✅ Shipped (2026-05-18). spec/0020 +
TombstoneListObject(canonical CBOR, sorted u64 anchors, parent-DAG) +Dataset::delete+Dataset::tombstone_set+dreamdb deleteCLI. Anchor granularity is Item-level (u64 TimeAnchor, the same key used by SpatialBucket records), so one tombstone suppresses every record across every modality — matches GDPR. Read-side filter wired intoiter_with_fields(anchor-set retain before blob fetch) anditer_stream(per-record skip). Tombstone Object addressed attombstones/<hash>. Manifest opts in viadreamdb.tombstonesregistry entry; absence means empty set (backward-compat). 9 new tests (6 protocol + 2 SDK + 1 address round-trip), 717 total green. Sub-anchor field-level tombstones + paged tombstone-lists + compact-tombstones operator deferred (spec/0020.1).
B9. Connector pool / HTTP/2 saturation.
- Current: 32 idle connections per host, single HTTP/2 stream effectively (per earlier
lsofshowing 1 socket under buffer_unordered(16)). - 10B impact: backend throughput likely bottleneck before workers are. Need multi-host fan-out or HTTP/2 stream tuning.
- Fix: HTTP/3 connector, or shard across multiple backend endpoints (which the protocol allows — refs/buckets aren't endpoint-bound).
- Status: ⏳ open
B10. Browser query path single-threaded JS.
- Current: ADC scoring runs on the JS main thread.
- 10B impact: queries to a 10B-record dataset would freeze the browser tab. Not actually a production concern (the browser demo is illustrative; real apps use the Rust SDK), but worth noting.
- Fix: Web Workers for ADC scoring. Out of scope for 10B-scale push.
🔵 P2 — works but should be improved
B11. Quantization drift on long-lived datasets.
- Decode-rebuild-encode cycle on RaBitQ compounds error. At 10B records over years of rebuilds, real concern.
- Fix:
rerank=Truemode (raw f32 stored alongside codes) — already shipped, not used on production datasets.
B12. Per-batch HTTP overhead in append_many.
- Current: ~N HTTP GET + N HTTP PUT per batch for N touched cells.
- 10B impact: with k bounded (Phase 2.2 merge step), each batch touches ~16 cells avg. Not the dominant cost.
Execution plan: 10B-blocker push
Order by leverage × bounded-LOC, NOT by what's most fun:
| # | Item | LOC | Time | Why this order |
|---|---|---|---|---|
| 1 | B1 streaming iter | ~400 | 1-2 days | Unblocks every read-side workflow; OOM is a hard wall |
| 2 | B5 parallel per-anchor blob fetch | ~30 | 1 hr | Ship with B1; same diff area |
| 3 | B3 sharded redispatch ✅ | ~400 | 1-2 days | Maintenance becomes feasible at 10B |
| 4 | B4 prefix-sharded GC ✅ | ~300 | 1 day | Without this storage grows forever |
| 5 | B8 tombstones ✅ | ~400 | 2-3 days | GDPR-blocking; spec work needed |
| 6 | B2 sharded ingest ✅ | ~600 | 3-5 days | Last because it can use B1+B3+B4+B8 once they exist |
| 7 | B6 IMI / rayon hash_vector ✅ | ~100 | half day | Optional; only needed if queries are slow |
| 8 | B7 verify Manifest size ✅ | audit only | 1 hr | Probably no work needed |
Total: ~2200 LOC over ~12-14 days of focused work. After this, DreamDB is structurally ready for 10B-scale workloads.
What's already 10B-ready
These were today's wins that ALSO carry through to 10B:
- ✅ Chain-aware lineage (Phase 3.1): rebuilds at O(touched cells) not O(N)
- ✅ Cold-bucket skip (Phase 3.2): same
- ✅ Paged tracks (Phase 3.4): inline-vs-paged auto-decides; no manual ceremony
- ✅ Content-addressed storage: cross-rebuild dedup means 10B records can take 1.5-2× their raw byte count, not 10×
- ✅ Read-online during rebuilds: queries on the OLD Manifest aren't blocked
- ✅ Snapshot/branch: 33-byte Refs scale to billions per dataset without effort
- ✅
iter_with_fields: P1.0 fix delivered 8.5× speedup at 231K; same gain proportional at 10B - ✅ Schema persistence in Manifest registry
The architecture is right. What remains is mechanical: every place we have an O(N) loop must become O(N/workers) or streaming. Every place we have a Vec<Foo> result must become a Stream<Item=Foo>.
Why no architectural changes are needed
The 10B-blocker list is striking for what's NOT on it: spec-level issues. No item requires a CBOR-shape change, no item requires revisiting lineage, no item requires a new Object type. Phase 3.1's chain-aware lineage was the load-bearing spec change; everything else is implementation work.
DreamDB's architectural framing is sound. The push to 10B is execution, not design.