DreamDBv0.2.0bec026

DreamDB Specification — 0022: Fragment Packs

Status: Draft, 2026-05-22. Builds on 0001, 0002 §7.3, 0007 §5–§6, 0014 (item chunking).

1. Purpose

0007 §5 defines the Fragment Object kind for media items (image, audio, video). The historical convention is one Fragment Object per Item — a single PUT per item on the write path. This is correct, but for small items (e.g. 16–64 KB JPEGs typical of a CLIP-encoded image corpus) and a write path that has to traverse the WAN to reach object storage, the per-PUT overhead (TCP/TLS handshake + RTT + HTTP framing) dominates over the actual payload bandwidth.

A Fragment Pack is a single Fragment Object that bundles many items together. The Track Object's object_index carries one FragmentEntry per item with a pack_offset field locating the item's bytes within the pack. Readers do byte-range GETs to extract individual items from a pack.

The net effect at write time: N items become ⌈N/pack_items⌉ S3 PUTs instead of N. At read time: nothing changes for sequential scans (they fetch the pack once anyway); random-access reads pay one ranged GET per item, the same RTT as today's one-object-per-item scheme.

This document fixes the on-wire format of FragmentEntry's pack_offset field, the pack body format (such as it is — see §3), and the read/write obligations.

2. Why no pack header

The simplest possible design: the pack body is just the concatenation of item bytes. Per-item indexing lives entirely in the Track Object's FragmentEntry (extended with pack_offset), not in the pack itself.

Rationale:

  • Track Object is already the source of truth for per-item placement (0005 §5.3.1, Manifest Supremacy). Putting a duplicate index in the pack header would mean two sources of truth.
  • Sequential reads stream the pack and iterate its known entries; a header would be pure overhead.
  • Random-access reads (rare for image corpora) pay one ranged GET, same as today's one-object-per-item scheme.
  • No pack-header version field means no pack-header format-evolution problem.

The downside is that a pack is only meaningful in conjunction with its FragmentEntries. A pack Object cannot self-describe its items. This is a deliberate trade: pack Objects without their owning Track are unrecoverable data, but DreamDB's content-addressing already requires the Track to resolve any item, so the constraint adds nothing.

3. On-wire format

3.1 Pack body

A FragmentPack Object's body is raw byte concatenation of item payloads in t_start order:

[item_0 bytes][item_1 bytes][item_2 bytes]...[item_{N-1} bytes]

No prefix. No header. No alignment. Each item begins immediately after the previous item ends.

3.2 Address

Packs are stored under the same address namespace as ordinary Fragment Objects (0007 §5.1): <timeline>/<modality>/<time-bucket>/<hash>. The content hash uniquely identifies the pack; no separate pack path prefix.

A reader cannot distinguish a pack from a single-item Fragment by looking at the Object alone — the distinction lives in the FragmentEntry that references it.

3.3 FragmentEntry extension

Per 0007 §7.3.1, FragmentEntry is a positional CBOR array:

  • 4 elements: [t_start, t_end, byte_size, fragment_address] — historical single-Fragment entry.
  • 5 elements: [t_start, t_end, byte_size, fragment_address, true] — chunked manifest (per 0014).
  • 6 elements: [t_start, t_end, byte_size, fragment_address, false, pack_offset]packed entry (this spec).

Where:

  • byte_size is the size of THIS item within the pack (not the size of the whole pack).
  • fragment_address is the content hash of the pack Object (multiple entries may share the same value).
  • The 5th element is the existing is_manifest flag, which MUST be false for packed entries (chunking and packing are mutually exclusive).
  • pack_offset is the byte offset of this item's bytes within the pack body. Encoded as a CBOR unsigned integer (≤ u32::MAX).

A reader retrieves the item's bytes by issuing GET fragment_address [pack_offset .. pack_offset + byte_size].

3.4 Pack invariants

For all FragmentEntries that share a fragment_address (i.e., reference the same pack):

  1. The entries' [pack_offset, pack_offset + byte_size) ranges MUST be disjoint and contiguous, starting at offset 0.
  2. The pack's body size MUST equal the sum of the entries' byte_size values (no gaps, no trailing bytes).
  3. Item time ranges MAY overlap (Items at the same anchor across modalities is allowed); the pack stores items in the time order they were emitted.

Writers MUST observe (1) and (2). Readers MAY assume (1) and (2) and MUST emit a clear error if a pack's actual byte length does not match the sum of entries.

4. Writer obligations

When Schema.<field>.pack_items = Some(N) and N > 1 and chunk_size = None:

  1. Collect items for this modality from one or more samples in the current append_many call.
  2. Group items into chunks of ≤ N items each. Within each chunk:
    • Compute the pack body = concatenation of item bytes in collection order.
    • PUT the pack as a single Object at <timeline>/<modality>/<time-bucket=0>/<content-hash>.
    • For each item in the chunk, emit a FragmentEntry with pack_offset set to the item's byte offset and byte_size set to the item's length.
  3. Append all emitted FragmentEntry instances to the Track Object's object_index (alongside any pre-existing entries; per 0007 §6.6, multi-entry-per-cell is the LSM steady state).

Writers MAY parallelize pack PUTs across chunks. The recommended concurrency is 64-wide (the same as for individual Fragment PUTs), implemented via buffer_unordered or equivalent.

When pack_items = None, pack_items = Some(1), or chunk_size = Some(_):

  • Writers MUST emit single-item Fragments (the historical path). The 4-element or 5-element FragmentEntry shapes apply.

5. Reader obligations

When reading a FragmentEntry:

  1. If is_manifest = true: follow the ItemManifest path (0014).
  2. Else if the entry has 6 positional elements with the 6th being a CBOR unsigned integer: this is a packed entry. Fetch the item by GET fragment_address [pack_offset .. pack_offset + byte_size].
  3. Else: this is a single-Fragment entry. Fetch the item by GET fragment_address (whole-Object).

Readers MAY cache pack bodies after a full GET to amortize subsequent ranged reads from the same pack within the same scan. Implementations SHOULD do this when the read iterator is sequential and item entries are consecutive (a common pattern for training-data streaming).

6. Compaction interaction

Packs are immutable like all DreamDB Objects. The compaction operation defined in 0021 (which consolidates multi-bucket-per-cell SpatialBucket Tracks back to F=1) is not directly applicable to Fragment Tracks today — Fragment Tracks accumulate FragmentEntry rows per append_many, not per-cell fragments.

Future work: a "pack compaction" operation that re-packs items spread across many small packs into fewer, larger packs. Deferred until empirical evidence of small-pack sprawl. Not addressed in this spec.

7. Why this design (vs alternatives)

AlternativeWhy rejected
Pack header CBOR with per-item index inside the packDuplicates the Track Object's index; two sources of truth; format-evolution problem.
Separate pack/ path prefix at the address layerAdds visible path complexity; consumer needs to know two paths exist. Content-addressing already disambiguates.
New InlineObjectIndex::FragmentPack variant in Track ObjectAdds a new variant to 129+ match sites across the codebase. Not justified for what is a small per-entry shape change.
Pack with internal byte-range index AND Track-level pack_offsetStrictly more bytes for the same information.
Chunking (0014) for packingChunking splits ONE big item into N PUTs. Packing bundles N small items into ONE PUT. Opposite directions; can't compose.

8. Conformance test vectors (deferred to 0009)

The conformance suite at 0009 §12 (added in this revision) carries:

  1. Round-trip a pack: write N=8 items with pack_items=4 → expect 2 packs in S3 → read back, byte-equality vs originals.
  2. Mixed packs in one Track: append in 3 batches with different N each batch → expect 3 separate packs, item count matches.
  3. Mixed packed and unpacked Fields in one Schema: confirm per-Field independence of the pack_items knob.
  4. Reject pack + chunk combination: ingest a sample with both chunk_size and pack_items > 1 set on the same field → writer MUST refuse, OR writer MUST silently degrade to chunking (current behavior).
  5. Byte-range fetch: a single-item GET with pack_offset set returns exactly [pack_offset .. pack_offset + byte_size] from the pack.

9. Measured impact

10K imagenet S3 ingest, laptop → us-east-1, fair A/B (back-to-back warm runs, 2026-05-22):

ModeS3 image objectsWall-clockRate
Unpacked (1 PUT/item)10,00072.0 s138.9 /s
Packed (pack_items=32)31366.1 s151.3 /s

The packing reduces S3 PUT count 32×. Wall-clock improves ~8% — modest because CLIP-MPS at ~150 /s is the dominant cost on this hardware. On GPU hardware where CLIP encoding is no longer the floor, the network savings translate to a much larger wall-clock win (estimated 5–10× per design/0008 §"Expected payoff").

The structural change is required for billion-scale ingest where individual-item PUTs run into S3 request-rate limits and dollar cost. With packing, network cost scales as O(N / pack_items) rather than O(N).

10. Open questions

OQDescription
OQ-47Should pack_items be schema-pinned or per-append_many call? Currently schema-pinned (simpler).
OQ-48Optimal pack size as a function of item size? Heuristic: aim for target_pack_bytes ≤ 2 MB. Not enforced; operator picks.
OQ-49Pack-level compaction (merging many small packs into fewer larger ones) — deferred until empirical small-pack sprawl observed.

Defer to first-implementation empirical data.