DreamDBv0.2.0bec026

Design 0008: Fragment Packs — bundle small items per S3 object

Status: PROMOTED to spec/0022-fragment-packs.md (2026-05-22). This design doc is kept as the historical context + rationale; the normative protocol is in spec/0022.

Context

The 1M imagenet S3 ingest benchmark (2026-05-22) revealed that small-file Fragment PUTs are the dominant ingest bottleneck for laptop → us-east-1 workloads:

  • Network RTT: 55 ms
  • 1024 × 45 KB Fragment PUTs at 64-way concurrency: ~18-22 s per batch
  • AWS CLI baseline: 51 files/s (identical to our connector) — confirms it's a network/protocol limit, not a code bug
  • CLIP-MPS encode rate: 192/s (1024 in 5.3 s) — not the bottleneck

End-to-end rate is ~80 samples/s, dominated by per-PUT overhead (TCP/TLS handshake + HTTP framing + RTT) rather than bandwidth. At 16 KB/image × 80/s = 1.3 MB/s effective; LAN capacity is ~7 MB/s. Most of the wire time is overhead, not payload.

Three observations make a protocol-level fix the right move (per the user's directive 2026-05-22):

  1. Code-level concurrency knobs are exhausted. PARALLEL_FRAGMENTS=256 makes throughput worse (S3 connection-closed errors at the higher rate). The network is saturated at our current burst pattern.
  2. The dominant cost is per-PUT overhead, not per-byte transfer. Packing N items into one S3 PUT amortizes the overhead by N×.
  3. The image-heavy ML training workload is recurring — every CLIP-bound corpus ingest hits this wall. A protocol primitive is more durable than a one-off pipeline trick.

The goal: enable ingest scripts to write multiple small items into a single S3 object (a "Fragment Pack"), reducing PUT count from N to N/pack_size while preserving content-addressing and per-item addressability.

What changes architecturally

A new Object kind: FragmentPack — a single S3 object containing K Fragment items packed contiguously, plus a small in-Object index of (t_start, size) per packed item. Each item is independently addressable via a (pack_hash, item_index) pair OR via a byte-range fetch directly from the pack.

The Track Object's object_index references FragmentPacks instead of (or alongside) individual Fragments. Each FragmentEntry carries:

  • t_start, t_end — same as today
  • byte_size — total pack size, not per-item
  • address — content hash of the pack
  • A new field: items: Vec<PackItemEntry> listing the per-item (t_anchor, offset, size) tuples within the pack

Or, equivalently, with less Track Object growth: the Track Object holds one entry per pack with the time range of the pack as a whole; the per-item index lives inside the pack header (read at fetch time).

Two encoding designs to choose between

Option A: External index (per-item in Track Object)

Track Object:
  object_index: [
    FragmentEntry {
      t_start, t_end, byte_size, pack_hash,
      items: [
        { t_anchor, offset_in_pack, size_in_pack },
        ...
      ]
    },
    ...
  ]

The pack itself is just [item_0_bytes][item_1_bytes]...[item_N-1_bytes] with no internal index.

Pros: Reading one item is a single ranged-GET (no extra round-trip to read pack index). Cons: Track Object inflates linearly with item count — the per-item array is at least 32 bytes per entry. At 1M items, 30 MB Track Object. Paged-track support already exists (per project_paged_indexes.md) so this scales, but adds metadata overhead.

Option B: Internal index (pack header)

S3 pack object:
  [PackHeader: {item_count, [(t_anchor_i, offset_i, size_i)]}]
  [item_0_bytes][item_1_bytes]...

Track Object:
  object_index: [
    FragmentEntry { t_start, t_end, pack_byte_size, pack_hash }
    ...
  ]

The Track Object stays compact (one entry per pack). The per-item index lives inside the pack header.

Pros: Track Object stays small. Pack header is self-describing — anyone with the pack hash can enumerate its items without consulting the Track. Cons: Reading one item is GET pack_header → byte-range GET item bytes — 2 round trips. Could be 1 round trip via HTTP multipart-range GET (S3 supports this).

Recommendation: Option B. The 2-round-trip cost is paid only on individual item lookups, which are rare in practice (training loops use streaming iter_* which reads sequential packs in full). Pack-header caching at the SDK layer collapses the 2-RTT to 1-RTT for hot packs.

Address format

New DreamDBAddress variants:

rust
DreamDBAddress::FragmentPack {
    timeline: Multihash,
    modality: ModalityTag,
    time_bucket: u64,
    hash: Multihash,
}

Path: <timeline>/<modality>/<time-bucket>/pack/<hash>

Items within a pack are addressed by (pack_hash, item_index) or (pack_hash, byte_range). No new address; addressing is via the Track Object's per-item index (option A) or the pack header (option B).

Pack header format (option B chosen)

PackHeader (CBOR):
  [
    "dreamdb.pack/1",          # magic + version
    item_count: u32,
    items: [
      { t_anchor: u64, offset: u32, size: u32 },
      ...
    ]
  ]

Bytes after the header are the raw item payloads concatenated.

Write-path changes (dreamdb-dataset/src/dataset/append.rs)

append_many already groups image/video items per batch. The new flow:

  1. After all items are CLIP-encoded and grouped, bucket items by time-window (e.g. records arriving within the same 10s window) into packs of up to pack_max_items (e.g. 32).
  2. For each pack: build a PackHeader, concatenate item bytes after the header, content-hash the whole thing, PUT once.
  3. Emit one FragmentEntry per pack in the Track Object (or one with embedded per-item index if option A).

Pack size tuning:

  • Too small (1-2 items): no overhead amortization, same as today.
  • Too large (1000+ items): single-item GETs pay for unrelated payload (waste).
  • Sweet spot: 32-128 items per pack, sized so total pack ≤ 2-5 MB.

At 32 items × 16 KB image = 512 KB per pack → S3 PUT of 512 KB instead of 32 × 16 KB PUTs. Per-PUT overhead amortized 32×.

Read-path changes (dreamdb-dataset/src/dataset/iter.rs, dreamdb-protocol/src/verbs/query_vector.rs)

For iter_stream (sequential read):

  1. Walk the Track Object's per-pack FragmentEntrys in time order.
  2. For each pack, GET the whole pack object once.
  3. Decode the header, iterate items in order, yield each as a record.

For iter_vector (top-K query):

  1. The vector path still uses the embedding modality (Spatial Buckets), unchanged.
  2. When the result set's anchors map to image items, look up the corresponding pack(s) via the Track Object.
  3. Fetch each needed pack (cache-aware), extract the relevant items by their (offset, size) from the header.

For single-anchor lookups:

  1. GET pack header (or use cached header).
  2. Locate (offset, size) for the target anchor.
  3. GET pack [offset..offset+size] via Range header.

Backward compatibility

Existing Datasets with per-item FragmentEntrys (no pack) keep working — the read path detects pack vs non-pack via a flag in the FragmentEntry (e.g. pack_item_count > 1 means packed).

New writers can opt into packing per modality:

python
schema.add_image("image", mime="jpeg", pack_items=32)

pack_items=None (default) writes one item per Fragment Object (today's behavior). pack_items=32 packs 32 items per Object.

Spec contribution

A new spec/0022-fragment-packs.md defines:

  1. The pack header CBOR format
  2. The Track Object's FragmentEntry extensions for packed/unpacked
  3. The new DreamDBAddress::FragmentPack variant
  4. Read path requirements (sequential decode, per-item byte-range)
  5. Conformance test vectors

Estimated effort

  • Spec doc: 0.5 days
  • Pack header type + encode/decode in dreamdb-protocol: 0.5 days
  • Write path in append.rs: 1 day
  • Read path in iter.rs + query_vector.rs: 1.5 days
  • Python SDK exposure (pack_items schema param): 0.5 days
  • Tests + benchmarks: 1 day
  • Total: 4-5 days (matches the original estimate)

Expected payoff (from the 2026-05-22 measurements)

At 32 items per pack, 1M imagenet ingest:

  • PUT count: 31,250 packs instead of 1M Fragments (32× reduction)
  • Per-pack PUT: ~512 KB at ~5 MB/s = 100 ms each (more bandwidth-bound than RTT-bound)
  • 31,250 / 64 concurrency × 100 ms = ~50 s of upload work
  • CLIP encoding for 1M: ~85 minutes at 192/s
  • End-to-end ETA: ~85 min (CLIP-bound, network amortized away)

Plus pipelining (already shipped 2026-05-22 evening): potentially reduces CLIP wall too via overlap.

Combined with EC2 us-east-1: ~10-15 minutes for 1M.

When to ship

After the 10K + 100K pipelined baseline runs are stable. The image-pack work is a next slice, not blocking the next 1M test. Ship after a green 100K verifies the pipeline overlap holds at intermediate scale.

Open questions

OQDescription
OQ-50Should pack_items be a runtime knob, schema-pinned, or backend-config?
OQ-51How to handle mid-stream pack_items change — backward-compat policy?
OQ-52Optimal pack size as function of item size? (Probably target_pack_bytes ≤ 4 MB.)
OQ-53Multi-pack consolidation analogous to dreamdb-cli compact for packs that became sparse from deletions?

Defer all four to first-implementation pass with empirical data.