10B-Scale Blocker Audit

2026-05-15. Target: dataset with 10⁹–10¹⁰ records works for the canonical ML workflow (ingest, train, sweep, GC, query). Every blocker gets a line; ordered by impact.

What "10B scale" actually means in numbers

Resource	10B-record budget (CLIP-style: image + 512-dim emb + label)
Raw embeddings (f32)	10⁹ × 512 × 4 = 2.0 TB
Compressed embeddings (RaBitQ 1-bit + corrections)	10⁹ × 76 = 76 GB
Image bytes (avg 50 KB JPEG)	10⁹ × 50 KB = 50 TB
Labels (avg 30 B string)	10⁹ × 30 = 30 GB
IVF cells at k=√N	√(10⁹) = 31,623 cells
Bucket Objects (60-record avg)	16.7M buckets
Manifest size (inline)	impossible — must page
Index Pages at fanout=100	16.7M / 1000 leaves ≈ 17K leaf pages + 170 internal
Anchor set in memory (u64)	10⁹ × 8 = 8 GB
GC list operation	10⁹ × header bytes + manifest walk

(For "10× more" — 100B records — multiply by 10.)

Hard architectural rules at 10B:

Nothing materializes O(N) in RAM
Every O(N) operation parallelizes across workers
Single-machine work bounded to O(log N) or O(√N)
Total elapsed for any single op ≤ wall-clock budget (typically 1 hour)

Blocker inventory (by severity)

🔴 P0 blockers — workflow OOMs or never completes

B1. Eager iter (Dataset::iter returns Vec<Batch>).

Current: materializes ALL records in RAM before yielding any batch
10B impact: would need 2 TB of RAM for embeddings alone. OOM at any single machine.
Live evidence: at 231K records, peak RSS 1.5 GB (5× the raw data because of Vec<Field::Embedding>)
Fix: Dataset::iter_stream(...) -> impl Stream<Item=Result<Batch>>. Bounded RAM per batch.
Status: ✅ Rust shipped (2026-05-15) + Python streaming shipped 2026-05-18 (B1.5). Embedding-only MVP; multi-modality merge-join + scalar streaming deferred. Live test: streaming 5120 records of 231K dataset at 112/s with bounded RAM. Python now has real Dataset.iter_stream(batch_size, fields, channel_buffer) returning a StreamBatchIter (PyO3 iterator with __iter__ + __next__). Background tokio task drives the Rust stream into an mpsc channel; __next__ block_on's the next batch. Verified: 50 records / batch_size=10 → 5 batches yielded one-at-a-time; early-break (del it) cleanly stops the spawned task.

B2. Sharded ingest doesn't exist.

Current: Dataset::append_many is single-process. CLIP at ~120 samples/s × 8.6M seconds = 99 days to ingest 10B images.
10B impact: dataset never gets built in the first place.
Fix: Workers ingest disjoint slices to per-worker BRANCHES; orchestrator merges branches via fast-forward or multi-parent. Reuses existing branch/merge primitives. ~600 LOC + spec for multi-parent merge.
Status: ✅ Shipped (2026-05-18). MergeStrategy::UnionTracks + Dataset::merge_many + dreamdb merge-many CLI. The 3-way union-merge walks both parent chains to find LCA, then for each embedding TrackEntry runs cell-by-cell reconciliation: per cell, take trunk (if branch unchanged) / take branch (if trunk unchanged) / fetch+merge buckets (if both diverged). Bucket merge unions records by time_anchor, refusing loudly on slice-assignment errors (same anchor on both sides with different vectors). New multi-parent Manifest has parents = [trunk_tip, branch_tip]. v0 limitations: SpatialBucket inline tracks only; non-embedding tracks (Fragment/Scalar/Constant) must be identical across trunk and branch. 3 new tests (union_merge_combines_disjoint_branch_appends, union_merge_no_op_when_same_tip, merge_many_combines_three_branches); also fixed an existing latent bug: FastForward merge wasn't refreshing field_tracks, so post-FF iter() returned stale records. Added Dataset::refresh_field_tracks_from_current called after both FF and UnionTracks. 721 total green.

B3. Sharded redispatch in ada-ivf-step.

Current: orchestrator does the redispatch serially. 10B records × ~1 ms/record = 2.8 hours single-machine.
10B impact: maintenance ops take half a day each.
Fix: Phase 3.3 (workers redispatch their slice in parallel, orchestrator stitches paged Track leaves). 4-stage k8s Job. ~400 LOC + docs.
Status: ✅ Shipped (2026-05-18). 4-stage pipeline: (1) centroid workers via --shard N --of M (existing); (2) --orchestrate --orchestrate-phase=publish-si aggregates centroids and publishes new SI Object + _ada_ivf/<job_id>/new_si.json marker; (3) --redispatch-shard N --of M workers read the marker, redispatch their owned cell slice (cell_id % M == N), emit per-shard bucket-entries JSON; (4) --orchestrate --orchestrate-phase=finalize aggregates redispatch shard JSONs, builds Track (auto-paged at >8000 entries), publishes Manifest, CAS Ref. Single-machine mode (no phase, no redispatch-shard) still works end-to-end (used by 90% of operators). 705 tests green.

B4. dreamdb-cli gc walks every object serially.

Current: serial list-prefix + parallel HEAD (32 concurrent) + parallel DELETE (32 concurrent).
10B impact: 10⁹+ list entries to enumerate. At 1000 list ops/s = 11 days just to list.
Fix: prefix-sharded GC — N workers handle disjoint backend key prefixes. Each enumerates + HEADs + DELETEs its slice in parallel. ~300 LOC.
Status: ✅ Shipped (2026-05-18). dreamdb gc --shard N --of M: each worker does the full LIST + live-set walk (cheap; required for correctness), then partitions candidates by leading-64-bits-of-multihash mod M. Only owns its slice for HEAD + DELETE. Live verified on imagenet-100-rabitq-corrected: 4 shards split 84,395 candidates as 20766/21007/21355/21267 (variance <2%). 3 new unit tests for partition uniformity. The LIST itself is single-worker per pod (S3 list_objects_v2 paginates internally; ~1000-50000 entries/s); HEAD+DELETE are the actual 10B bottleneck and they parallelize M× cleanly. Sub-LIST prefix-shard parallelism (multi-prefix LIST per worker) deferred — not on the critical path.

B5. Per-anchor join loop in iter_time_range is serial.

Current: for each anchor, fetch blob bytes; serial loop.
10B impact: 10⁹ × 5 ms HTTP RTT = 57 days just for blob fetches.
Fix: parallelize the inner per-anchor blob fetch with buffer_unordered(64). Same pattern as P4.0c (embedding fetch).
Status: ✅ Shipped (2026-05-15). Per-chunk pre-fetch via buffer_unordered(64) builds an anchor→bytes map; per-anchor join is now in-memory lookup.

🟡 P1 blockers — works but degraded at scale

B6. Single-process IVF dispatch (hash_vector) per query.

Current: O(k × dim) per record. At k=31K, dim=512 → 16M multiplies = ~5 ms per query on a single CPU core.
10B impact: 200 q/s/core query throughput. Probably acceptable but tight.
Fix A: parallel hash_vector via rayon (5-8× speedup on 8-core).
Fix B: switch to IMI partitioning (k_sub=√k ≈ 177; per-query cost √k × dim = 90K muls = ~30 µs per query). 100× speedup. Algorithm already in protocol; not used in production datasets.
Status: ✅ Shipped (2026-05-18). Fix A: IvfCosine::compute_dots parallelized via rayon when k * dim > 512K flops (below threshold the serial path runs to avoid fork-join overhead for small training-scale k). Per-query latency at 10B-scale drops from ~5 ms to ~700 µs on 8-core. Fix B (IMI in production): already plumbed through SpatialDispatcher::Imi in dreamdb-dataset; choosing algorithm: "dreamdb.imi-cosine" at schema-create time activates the √k × dim path automatically. No additional code needed. 1 new determinism test (parallel_compute_dots_matches_serial).

B7. Manifest size at 10B.

Current: inline tracks paged when > 1 MiB; Manifest's tracks field could itself be paged but I haven't verified.
10B impact: with k=31K + 4 fields × ~120 B per Track entry, Manifest tracks list is ~15 KB. Inline-fine. Per-field paged Tracks have many leaf pages; Manifest is small either way.
Fix: probably no action needed; audit to confirm Manifest stays under 1 MiB at 10B.
Status: ✅ Verified (2026-05-18) — no action needed. Math: 10B records × 1 timeline × ~5 fields (image, embedding, label, +1-2 layered scalars). Manifest contains: (a) tracks field = 5 × ~150 B per TrackEntry = ~750 B; (b) registry = 5 modalities × ~100 B + schema CBOR (~1 KB) + dreamdb.tombstones head (~50 B) = ~1.5 KB; (c) parents = 1-N × 33 B. Total Manifest ≈ 2.5 KB at 10B records. Headroom factor: 400×. The data volume lives in PAGED Tracks/Buckets that the Manifest REFERENCES, not in the Manifest itself. Inline tracks paging (already supported per spec/0008 §4) would only matter for pathological cases (thousands of distinct modalities); not a 10B-scale concern in any normal workflow.

B8. No tombstones / deletion.

Current: append-only; deletion is a full data rewrite.
10B impact: any single GDPR request requires rewriting the entire dataset. Untenable.
Fix: tombstone primitive in spec/0020. Per-modality dreamdb.tombstones registry entry of (ordinal, anchor_hash) pairs. Query path skips matched records. GC eventually compacts.
Status: ✅ Shipped (2026-05-18). spec/0020 + TombstoneListObject (canonical CBOR, sorted u64 anchors, parent-DAG) + Dataset::delete + Dataset::tombstone_set + dreamdb delete CLI. Anchor granularity is Item-level (u64 TimeAnchor, the same key used by SpatialBucket records), so one tombstone suppresses every record across every modality — matches GDPR. Read-side filter wired into iter_with_fields (anchor-set retain before blob fetch) and iter_stream (per-record skip). Tombstone Object addressed at tombstones/<hash>. Manifest opts in via dreamdb.tombstones registry entry; absence means empty set (backward-compat). 9 new tests (6 protocol + 2 SDK + 1 address round-trip), 717 total green. Sub-anchor field-level tombstones + paged tombstone-lists + compact-tombstones operator deferred (spec/0020.1).

B9. Connector pool / HTTP/2 saturation.

Current: 32 idle connections per host, single HTTP/2 stream effectively (per earlier lsof showing 1 socket under buffer_unordered(16)).
10B impact: backend throughput likely bottleneck before workers are. Need multi-host fan-out or HTTP/2 stream tuning.
Fix: HTTP/3 connector, or shard across multiple backend endpoints (which the protocol allows — refs/buckets aren't endpoint-bound).
Status: ⏳ open

B10. Browser query path single-threaded JS.

Current: ADC scoring runs on the JS main thread.
10B impact: queries to a 10B-record dataset would freeze the browser tab. Not actually a production concern (the browser demo is illustrative; real apps use the Rust SDK), but worth noting.
Fix: Web Workers for ADC scoring. Out of scope for 10B-scale push.

🔵 P2 — works but should be improved

B11. Quantization drift on long-lived datasets.

Decode-rebuild-encode cycle on RaBitQ compounds error. At 10B records over years of rebuilds, real concern.
Fix: rerank=True mode (raw f32 stored alongside codes) — already shipped, not used on production datasets.

B12. Per-batch HTTP overhead in append_many.

Current: ~N HTTP GET + N HTTP PUT per batch for N touched cells.
10B impact: with k bounded (Phase 2.2 merge step), each batch touches ~16 cells avg. Not the dominant cost.

Execution plan: 10B-blocker push

Order by leverage × bounded-LOC, NOT by what's most fun:

#	Item	LOC	Time	Why this order
1	B1 streaming iter	~400	1-2 days	Unblocks every read-side workflow; OOM is a hard wall
2	B5 parallel per-anchor blob fetch	~30	1 hr	Ship with B1; same diff area
3	B3 sharded redispatch ✅	~400	1-2 days	Maintenance becomes feasible at 10B
4	B4 prefix-sharded GC ✅	~300	1 day	Without this storage grows forever
5	B8 tombstones ✅	~400	2-3 days	GDPR-blocking; spec work needed
6	B2 sharded ingest ✅	~600	3-5 days	Last because it can use B1+B3+B4+B8 once they exist
7	B6 IMI / rayon hash_vector ✅	~100	half day	Optional; only needed if queries are slow
8	B7 verify Manifest size ✅	audit only	1 hr	Probably no work needed

Total: ~2200 LOC over ~12-14 days of focused work. After this, DreamDB is structurally ready for 10B-scale workloads.

What's already 10B-ready

These were today's wins that ALSO carry through to 10B:

✅ Chain-aware lineage (Phase 3.1): rebuilds at O(touched cells) not O(N)
✅ Cold-bucket skip (Phase 3.2): same
✅ Paged tracks (Phase 3.4): inline-vs-paged auto-decides; no manual ceremony
✅ Content-addressed storage: cross-rebuild dedup means 10B records can take 1.5-2× their raw byte count, not 10×
✅ Read-online during rebuilds: queries on the OLD Manifest aren't blocked
✅ Snapshot/branch: 33-byte Refs scale to billions per dataset without effort
✅ iter_with_fields: P1.0 fix delivered 8.5× speedup at 231K; same gain proportional at 10B
✅ Schema persistence in Manifest registry

The architecture is right. What remains is mechanical: every place we have an O(N) loop must become O(N/workers) or streaming. Every place we have a Vec<Foo> result must become a Stream<Item=Foo>.

Why no architectural changes are needed

The 10B-blocker list is striking for what's NOT on it: spec-level issues. No item requires a CBOR-shape change, no item requires revisiting lineage, no item requires a new Object type. Phase 3.1's chain-aware lineage was the load-bearing spec change; everything else is implementation work.

DreamDB's architectural framing is sound. The push to 10B is execution, not design.