Sharded Ingest via Branch+Union-Merge
Status (2026-05-18): this document is normative for the v0 reference implementation of
MergeStrategy::UnionTracks(the fused-merge approach). The protocol-level framing — that layered-merge and fused-merge are both valid resolutions for multi-parent Manifests — is inspec/0008 §5.3. A future revision MAY promote the algorithm here to its own spec number.
2026-05-18. Implementing B2 from design/0006-10b-scale-blockers.md. Other six blockers (B1, B5, B3, B4, B8, B6) have shipped; B2 is the last execution-side blocker before 10B-scale is structurally ready. This doc captures the algorithm so the implementation slice can run end-to-end.
Problem
A single-process Dataset::append_many ingests ~120 CLIP samples/second (image encode dominates). 10B images at that rate = 99 days. 100B = 990 days. Unacceptable.
The fix is the standard data-parallel pattern: N workers each ingest a disjoint slice into a personal branch; orchestrator merges branches back into trunk. The DreamDB Protocol's content-addressing makes this safe (no double-write conflicts) but it doesn't free the operator from implementing the merge correctly.
Why FastForward isn't sufficient
The existing Dataset::merge(branch, MergeStrategy::FastForward) works when the branch is a strict descendant of trunk. For sharded ingest:
After fork, every branch IS a descendant of trunk@T0 → first FF works:
But now trunk is at T_w0; w1's parent is T0, not T_w0. The second FF fails because w1 is NOT a descendant of T_w0 (it's a sibling). This is the diverged-history case that requires real merge.
The merge algorithm (3-way union, single branch)
Given:
- Trunk's current Manifest
T_self. - Branch's current Manifest
T_other. - LCA (lowest common ancestor)
T_lca, found by walkingparents[0]on both sides and finding the first hash in both chains.
For each TrackEntry in each Manifest's tracks:
- Find the corresponding TrackEntry in
T_self,T_other,T_lca(same(timeline, modality)key). - Refuse if any modality is present in one side and absent in the other (schema divergence — out of scope for v0).
- Refuse if the modality's SpatialIndex hash differs across the three (catastrophic per
project_collab_disciplines.md). - Walk the three Tracks'
object_index:
For SpatialBucket tracks:
- Build
cells: HashMap<SpatialKey, (lca_entry, self_entry, other_entry)>from the union of all three. - For each cell key
k:- If
self_entry.bucket_address == lca_entry.bucket_addressANDother_entry.bucket_address == lca_entry.bucket_address→ cell unchanged everywhere. Use LCA's entry. - If
self_entry.bucket_address == lca_entry.bucket_addressonly → branch touched this cell. Useother_entry. - If
other_entry.bucket_address == lca_entry.bucket_addressonly → trunk touched this cell. Useself_entry. - Both differ from LCA → merge buckets (see below).
- If
- Emit a new TrackEntry with the merged cells.
For Fragment tracks:
- Concatenate entries from
self.entries.diff(lca)andother.entries.diff(lca). Sort byt_start. No record-level merge needed — fragments are independent.
For ScalarBucket tracks:
- Per scalar value, union the anchor lists from each side's bucket; write a new bucket if both sides modified the same value's bucket; otherwise use whichever side touched it.
For Constant tracks:
- Refuse if the Constant Objects differ (operator must pick a winner explicitly).
Bucket-merge subroutine (the actually-novel piece)
For a cell k modified on both sides:
- Fetch
bucket_self = SpatialBucket::from_bytes(connector.get(self_entry.bucket_address)?). - Fetch
bucket_other = SpatialBucket::from_bytes(connector.get(other_entry.bucket_address)?). - Fetch
bucket_lca = SpatialBucket::from_bytes(connector.get(lca_entry.bucket_address)?). - Reuse
append_many's existing consolidation logic (line ~1100-1250 ofdataset.rs):- Records in
bucket_self ∪ bucket_other(deduplicated bytime_anchor— equality means literally the same Item appended twice; refuse loudly because that means the slice assignment was wrong). - Write the merged bucket as a new Object.
- Records in
- Build new SpatialBucketEntry with the merged bucket's hash.
Manifest construction
Multi-parent parents is a single load-bearing CBOR change — already supported by the Manifest decoder (see spec/0008).
Refusal scenarios
The v0 union-merge MUST refuse loudly (not silently re-encode) in these cases:
- Modality present in one Manifest but absent in another.
- SpatialIndex hash differs across (self, other, lca) for any modality.
- A bucket-merge subroutine finds two records with the same
time_anchor(slice-assignment was wrong). - The LCA walk doesn't terminate within 1000 steps in either direction.
- A non-SpatialBucket Track Object has diverged in an unsupported way (e.g., Constant differs).
Multi-branch merge: Dataset::merge_many(branches: &[&str])
For N parallel ingest branches, the operator wants one atomic merge that captures all N at once. The single-branch algorithm above generalizes:
- LCA: must be common across all (N+1) tips. Walk every chain; intersect ancestor sets.
- Per cell, the conflict set is
{branch_i : branch_i_bucket_addr ≠ lca_bucket_addr}. Merge all buckets in the conflict set in one shot. - Manifest:
parents: vec![T_self, T_b0, T_b1, ..., T_b{N-1}].
Cost analysis at 10B records / N=64 branches:
- LCA walk: 64 chain walks × ~3 steps each = trivial.
- Per cell: average N=10 branches modify it. 16.7M cells × 10 GETs = 167M GETs. With
buffer_unordered(256)and ~5ms/GET = ~57 minutes. - Total wall-clock: ~1 hour for the merge phase, dwarfed by the 6+ days of parallel CLIP inference it enables.
This is the actual 10B-scale benefit: ingest goes from 99 days serial to ~3.5 days parallel (N=64) + 1 hour merge.
Operator pattern (k8s)
The merge orchestrator is single-pod (one CAS at the end). The 1 hour merge time is acceptable for a 3.5-day ingest cycle.
Implementation slice (shipped 2026-05-18)
Protocol formalization (the layered-vs-fused distinction) lives in spec/0008 §5.3. The full algorithm above is normative for v0 implementations of fused-merge; a future revision MAY promote it to its own spec number.
Status
Designed + implemented (2026-05-18). Shipped:
MergeStrategy::UnionTracksindreamdb-dataset— single-branch 3-way union-merge with the algorithm described above. Refusal scenarios are wired (1, 2, 3 above; 4 is bounded at 1000-step LCA walk; 5 is the paged/non-SpatialBucket case).Dataset::merge_many(branches: &[&str])— sequential wrapper.dreamdb merge-many --backend ... --ref-name <trunk> <branches>...CLI.- 3 integration tests: disjoint-branch combine, no-op same-tip, three-branch chain.
- Bug fix discovered en route: post-FF
field_trackswasn't being refreshed → reads returned stale pre-merge results.Dataset::refresh_field_tracks_from_current()now runs after both FF and UnionTracks.
What remains as follow-up (not on the 10B-blocker critical path):
- Single-Manifest
(N+1)-parent batched merge (one fewer Manifest + CAS per merge run; sequential is fine for now). - Fragment / Scalar / Constant union-merge (currently refuses non-embedding diverged tracks).
- Promotion of this design doc to a full
spec/00XX(the algorithm is already normative; just hasn't been assigned a spec number). - k8s YAML example in
dreamdb-cli/examples/sharded-ingest.yaml.