Design 0008: Fragment Packs — bundle small items per S3 object
Status: PROMOTED to
spec/0022-fragment-packs.md(2026-05-22). This design doc is kept as the historical context + rationale; the normative protocol is in spec/0022.
Context
The 1M imagenet S3 ingest benchmark (2026-05-22) revealed that small-file Fragment PUTs are the dominant ingest bottleneck for laptop → us-east-1 workloads:
- Network RTT: 55 ms
- 1024 × 45 KB Fragment PUTs at 64-way concurrency: ~18-22 s per batch
- AWS CLI baseline: 51 files/s (identical to our connector) — confirms it's a network/protocol limit, not a code bug
- CLIP-MPS encode rate: 192/s (1024 in 5.3 s) — not the bottleneck
End-to-end rate is ~80 samples/s, dominated by per-PUT overhead (TCP/TLS handshake + HTTP framing + RTT) rather than bandwidth. At 16 KB/image × 80/s = 1.3 MB/s effective; LAN capacity is ~7 MB/s. Most of the wire time is overhead, not payload.
Three observations make a protocol-level fix the right move (per the user's directive 2026-05-22):
- Code-level concurrency knobs are exhausted.
PARALLEL_FRAGMENTS=256makes throughput worse (S3 connection-closed errors at the higher rate). The network is saturated at our current burst pattern. - The dominant cost is per-PUT overhead, not per-byte transfer. Packing N items into one S3 PUT amortizes the overhead by N×.
- The image-heavy ML training workload is recurring — every CLIP-bound corpus ingest hits this wall. A protocol primitive is more durable than a one-off pipeline trick.
The goal: enable ingest scripts to write multiple small items into a single S3 object (a "Fragment Pack"), reducing PUT count from N to N/pack_size while preserving content-addressing and per-item addressability.
What changes architecturally
A new Object kind: FragmentPack — a single S3 object containing K Fragment items packed contiguously, plus a small in-Object index of (t_start, size) per packed item. Each item is independently addressable via a (pack_hash, item_index) pair OR via a byte-range fetch directly from the pack.
The Track Object's object_index references FragmentPacks instead of (or alongside) individual Fragments. Each FragmentEntry carries:
t_start, t_end— same as todaybyte_size— total pack size, not per-itemaddress— content hash of the pack- A new field:
items: Vec<PackItemEntry>listing the per-item (t_anchor, offset, size) tuples within the pack
Or, equivalently, with less Track Object growth: the Track Object holds one entry per pack with the time range of the pack as a whole; the per-item index lives inside the pack header (read at fetch time).
Two encoding designs to choose between
Option A: External index (per-item in Track Object)
The pack itself is just [item_0_bytes][item_1_bytes]...[item_N-1_bytes] with no internal index.
Pros: Reading one item is a single ranged-GET (no extra round-trip to read pack index).
Cons: Track Object inflates linearly with item count — the per-item array is at least 32 bytes per entry. At 1M items, 30 MB Track Object. Paged-track support already exists (per project_paged_indexes.md) so this scales, but adds metadata overhead.
Option B: Internal index (pack header)
The Track Object stays compact (one entry per pack). The per-item index lives inside the pack header.
Pros: Track Object stays small. Pack header is self-describing — anyone with the pack hash can enumerate its items without consulting the Track. Cons: Reading one item is GET pack_header → byte-range GET item bytes — 2 round trips. Could be 1 round trip via HTTP multipart-range GET (S3 supports this).
Recommendation: Option B. The 2-round-trip cost is paid only on individual item lookups, which are rare in practice (training loops use streaming iter_* which reads sequential packs in full). Pack-header caching at the SDK layer collapses the 2-RTT to 1-RTT for hot packs.
Address format
New DreamDBAddress variants:
Path: <timeline>/<modality>/<time-bucket>/pack/<hash>
Items within a pack are addressed by (pack_hash, item_index) or (pack_hash, byte_range). No new address; addressing is via the Track Object's per-item index (option A) or the pack header (option B).
Pack header format (option B chosen)
Bytes after the header are the raw item payloads concatenated.
Write-path changes (dreamdb-dataset/src/dataset/append.rs)
append_many already groups image/video items per batch. The new flow:
- After all items are CLIP-encoded and grouped, bucket items by time-window (e.g. records arriving within the same 10s window) into packs of up to
pack_max_items(e.g. 32). - For each pack: build a
PackHeader, concatenate item bytes after the header, content-hash the whole thing, PUT once. - Emit one
FragmentEntryper pack in the Track Object (or one with embedded per-item index if option A).
Pack size tuning:
- Too small (1-2 items): no overhead amortization, same as today.
- Too large (1000+ items): single-item GETs pay for unrelated payload (waste).
- Sweet spot: 32-128 items per pack, sized so total pack ≤ 2-5 MB.
At 32 items × 16 KB image = 512 KB per pack → S3 PUT of 512 KB instead of 32 × 16 KB PUTs. Per-PUT overhead amortized 32×.
Read-path changes (dreamdb-dataset/src/dataset/iter.rs, dreamdb-protocol/src/verbs/query_vector.rs)
For iter_stream (sequential read):
- Walk the Track Object's per-pack
FragmentEntrys in time order. - For each pack, GET the whole pack object once.
- Decode the header, iterate items in order, yield each as a record.
For iter_vector (top-K query):
- The vector path still uses the embedding modality (Spatial Buckets), unchanged.
- When the result set's anchors map to image items, look up the corresponding pack(s) via the Track Object.
- Fetch each needed pack (cache-aware), extract the relevant items by their
(offset, size)from the header.
For single-anchor lookups:
- GET pack header (or use cached header).
- Locate
(offset, size)for the target anchor. - GET pack
[offset..offset+size]via Range header.
Backward compatibility
Existing Datasets with per-item FragmentEntrys (no pack) keep working — the read path detects pack vs non-pack via a flag in the FragmentEntry (e.g. pack_item_count > 1 means packed).
New writers can opt into packing per modality:
pack_items=None (default) writes one item per Fragment Object (today's behavior). pack_items=32 packs 32 items per Object.
Spec contribution
A new spec/0022-fragment-packs.md defines:
- The pack header CBOR format
- The Track Object's
FragmentEntryextensions for packed/unpacked - The new
DreamDBAddress::FragmentPackvariant - Read path requirements (sequential decode, per-item byte-range)
- Conformance test vectors
Estimated effort
- Spec doc: 0.5 days
- Pack header type + encode/decode in
dreamdb-protocol: 0.5 days - Write path in
append.rs: 1 day - Read path in
iter.rs+query_vector.rs: 1.5 days - Python SDK exposure (
pack_itemsschema param): 0.5 days - Tests + benchmarks: 1 day
- Total: 4-5 days (matches the original estimate)
Expected payoff (from the 2026-05-22 measurements)
At 32 items per pack, 1M imagenet ingest:
- PUT count: 31,250 packs instead of 1M Fragments (32× reduction)
- Per-pack PUT: ~512 KB at ~5 MB/s = 100 ms each (more bandwidth-bound than RTT-bound)
- 31,250 / 64 concurrency × 100 ms = ~50 s of upload work
- CLIP encoding for 1M: ~85 minutes at 192/s
- End-to-end ETA: ~85 min (CLIP-bound, network amortized away)
Plus pipelining (already shipped 2026-05-22 evening): potentially reduces CLIP wall too via overlap.
Combined with EC2 us-east-1: ~10-15 minutes for 1M.
When to ship
After the 10K + 100K pipelined baseline runs are stable. The image-pack work is a next slice, not blocking the next 1M test. Ship after a green 100K verifies the pipeline overlap holds at intermediate scale.
Open questions
| OQ | Description |
|---|---|
| OQ-50 | Should pack_items be a runtime knob, schema-pinned, or backend-config? |
| OQ-51 | How to handle mid-stream pack_items change — backward-compat policy? |
| OQ-52 | Optimal pack size as function of item size? (Probably target_pack_bytes ≤ 4 MB.) |
| OQ-53 | Multi-pack consolidation analogous to dreamdb-cli compact for packs that became sparse from deletions? |
Defer all four to first-implementation pass with empirical data.