DreamDBv0.2.0bec026

DreamDB Specification — 0007: Streaming Encapsulation & Object Byte Layouts

Status: Draft. Builds on 00000006. This document fixes the byte format inside every Object kind: Fragments (media), Spatial Buckets (vectors), Time-bucketed batches (events), Index Pages, and the constant-byte vector storage layout. Resolves OQ-4, OQ-13, OQ-14, OQ-15, OQ-20, OQ-21, OQ-23, OQ-24.


1. Purpose

00010006 defined what DreamDB Objects exist, how they're addressed, and how SDKs operate on them. This document defines what the bytes inside each Object look like.

By the end of this document, the following are concrete:

  • For each Object kind: the byte layout, header format, and intra-Object addressing scheme.
  • The Fragment format for media tracks (resolves OQ-4).
  • The Spatial Bucket format for vector tracks, including the byte-range-reference pattern that makes multi-table affordable (resolves OQ-23).
  • The Time-bucketed batch format for high-volume event tracks.
  • The Index Page format with delta-encoding for scale (resolves OQ-20, OQ-21).
  • The Bucket-splitting rule for hot spatial keys (resolves OQ-24).
  • The default Index Page fanout and target page size (resolves OQ-14).
  • The inline-vs-paged threshold (resolves OQ-15).
  • The canonical segment order for spatiotemporally-partitioned spatial keys (resolves OQ-13).

What this document does not define:

  • The DreamDB Protocol verb logic — 0006.
  • Manifest history walking and merge semantics — 0008.
  • Conformance test vectors — 0009.

2. Object Kinds, Recap

Object kindDefined inInternal byte formatWhere its address lives
Genesis0001 §5Deterministic CBOR (§3)genesis/<hash>
Manifest0002 §7.2Deterministic CBOR (§3)manifests/<hash>
Track Object0002 §7.3Deterministic CBOR (§3)<timeline>/<modality>/track/<hash>
Index Page0002 §7.2.2 / §7.3.2Det. CBOR + tight inner format (§7)manifests/index/<hash> or <timeline>/<modality>/index/<hash>
SpatialIndex0004 §3Deterministic CBOR (§3)spatial-index/<hash>
Constant0001 §4.3Modality-defined (§4)<timeline>/<modality>/<hash>
Fragment0001 §4.1Streaming media container (§5)<timeline>/<modality>/<time-bucket>/<hash>
Spatial Bucket0001 §4.1DreamDB Bucket Format (§6)<timeline>/<modality>/<spatial-key>/[<time-bucket>/]<hash>
Time-bucketed batch0001 §4.2DreamDB Event Batch Format (§8)<timeline>/<modality>/<time-bucket>/<hash>

Genesis, Manifest, Track Object, and SpatialIndex are pure CBOR — covered by 0002 §3. This document fixes the four "data plane" formats: Constant (§4), Fragment (§5), Spatial Bucket (§6), and Time-bucketed batch (§8) — plus the Index Page byte layout (§7).

3. Universal Conventions

Across all data-plane Object formats:

  • Multibyte integers: little-endian, fixed width. Spec call-outs specify size (u8, u16, u32, u64).
  • No padding: fields are tightly packed; no alignment bytes.
  • Magic-byte fields: where Object headers carry a "magic" identifier (Bucket Object §6.1, VectorStorage §6.3, Time-batch §8.1, GraphPage 0013 §4.2, EncryptionMeta 0019 §4.1), the field is a 4-byte ASCII byte sequence written in document order, NOT a little-endian u32. A magic field documented as "VBUU" means the four bytes 0x56 0x42 0x55 0x55 appear on disk in that order. Implementers comparing magic bytes MUST do a byte-string compare (memcmp("VBUU", buf, 4)) and MUST NOT load the bytes as a u32 with 0x56425555 — that would be host-endian dependent and break on big-endian hosts. The hex form sometimes given in spec text is the big-endian interpretation of the byte sequence for human-readability; on disk the byte sequence is byte-order-independent.
  • All hashes: 33-byte multihash (1 byte algorithm tag + 32 bytes BLAKE3, per 0002 §2.1). Bare in headers (no length prefix needed; size is constant).
  • All time anchors: 8-byte u64 ns ticks since Genesis (per 0003). Bare; integer-only arithmetic per 0003 §4.1.
  • Vector floats: little-endian IEEE 754 binary32 (per 0004 §5.4). f32-determinism discipline applies.

4. Constant Object

The simplest format. A Constant Object is the constant's payload bytes — no header, no framing, no metadata. Modality-determined interpretation:

  • title.text, author.text, description.text, license.spdx: UTF-8 byte string. No BOM, no terminator.
  • author.json, annotation.json (when used as constant): UTF-8 JSON.
  • source.uri: ASCII URI string.

Application of the Constant Object is the application's concern; DreamDB stores the bytes opaquely.

Size limit: 1 MiB per Constant Object. Larger constants are an antipattern (use a Fragment or external storage). Writers MUST refuse to write a Constant > 1 MiB; readers MUST reject Constant Objects > 1 MiB as malformed.

5. Fragment Format (Media — resolves OQ-4)

Media modalities (video.*, audio.*) use fragmented MP4 / CMAF as the container format for DreamDB Fragments. This is the same containerization used by HLS/DASH/CMAF and natively decoded by every modern media decoder (browsers, FFmpeg, hardware decoders, gst).

5.1 Why CMAF / fragmented MP4

  • Industry-proven: every CDN-fronted streaming service uses fragmented MP4 or CMAF. Decoders are battle-tested.
  • Self-contained per Fragment: each Fragment includes its own moof (movie fragment) box; can be decoded standalone (after the Track's moov initialization segment is known).
  • Byte-rangeable internally: the per-frame byte offsets within a Fragment are derivable from the Fragment header without parsing the rest. Aligns with DreamDB's bytes:<start>-<end> URI form.
  • GOP-aligned natively: codecs produce GOPs (Groups of Pictures) at I-frame boundaries; CMAF Fragments align to those boundaries by convention.

5.2 Per-Track initialization segment

A media Track has an initialization segment — a small CBOR-wrapped MP4 moov that describes the codec, sample rate, dimensions, etc. The init segment is referenced by the Track Object as a separate Object:

TrackObject {
   ...
   "init_segment": <multihash-of-init-segment-Object>,
   "object_index": ...,    -- Fragment index per 0002 §7.3
}

Init segment Object lives at:

<timeline>/<modality>/init/<hash>

(The literal init segment distinguishes from track, index, and the time-keyed Fragment paths.)

Decoders that have fetched the init segment and a Fragment can decode the Fragment immediately — same flow as MSE (Media Source Extensions) in browsers.

5.3 Fragment byte layout

A DreamDB Fragment Object is a single CMAF Fragment:

[moof box] [mdat box]
  • moof (movie fragment): contains track-fragment headers (tfhd), track runs (trun), and per-sample byte offsets relative to the start of mdat.
  • mdat (media data): the encoded samples (frames) themselves.

Per-frame byte ranges are derived from moof's sample table — readable in ~1 KB at the start of the Fragment. DreamDB's Stream verb (0006 §4.4) consumes whole Fragments (or byte ranges within a Fragment), feeding them to the decoder.

5.4 Fragment duration (resolves part of OQ-10)

Default Fragment duration: 2 seconds (low-latency seek; 60 frames @ 30 fps; balanced object count).

Modality MAY override via parameter: video.h264.frag-duration=6s for archival / cheap-storage mode (~3× fewer Objects per hour, longer first-byte latency, larger Index entries).

Permitted range: 1 s ≤ frag-duration ≤ 30 s. Outside this range, encoding properties degrade (sub-1 s: Fragment overhead exceeds payload; > 30 s: GOP boundaries become rare in encoders that prefer 2–10 s GOPs).

5.5 Bucket-duration alignment

Per 0002 §6.3.1, Fragments live at <timeline>/<modality>/<time-bucket>/<hash> where <time-bucket> = floor(t_start / bucket-duration). The bucket-duration parameter for media modalities defaults to 60 s (modality parameter bucket=60s); operators MAY tune.

Default placement: 30 Fragments per time-bucket (60 s bucket / 2 s Fragment). This is the "10–100× ratio" target from project_billion_scale_time.md — bucket aggregation is meaningful, address-segment overhead is amortized.

6. Spatial Bucket Format (resolves OQ-23)

Spatial Bucket Objects hold a packed array of vectors plus a byte-rangeable index. Two storage modes — choose at modality definition time:

6.1 Inline-vector mode (default for small/medium tracks)

Inline mode requires the modality to declare a fixed record_size. Every record is exactly the same size, so byte_offset(idx) = header_size + idx × record_size is well-defined without any in-Object lookup table.

Modalities with variable-size vector encodings (sparse vectors, codebook-dependent PQ encodings, multi-dimensionality bundles) MUST use reference mode (§6.2) — Vector-Storage Objects can carry per-record byte ranges in their own offset table. Inline mode is reserved for fixed-size record modalities.

A modality that declares inline mode without a fixed record_size (e.g. by leaving dim unparameterized) is malformed; readers MUST reject Tracks of such modalities.

Vectors are stored inline, byte-addressable. Layout:

┌──────────────────────────────────────────────────────────┐
│ Bucket Header (160 bytes, fixed)                         │
│   magic:             4 bytes  = 0x56425555 ("VBUU" ASCII)│
│   version:           u32      = 1                         │
│   record_size:       u32      = bytes per vector entry    │
│   record_count:      u32      = N vectors in this bucket  │
│   header_size:       u32      = 160 (this header's size)  │
│   spatial_index_hash: 33 bytes (multihash of the          │
│                       SpatialIndex Object that produced   │
│                       this bucket — per 0004 §3)          │
│   modality:          32-byte ASCII (zero-padded modality  │
│                       tag prefix; for verification)       │
│   reserved:          remaining bytes set to 0             │
├──────────────────────────────────────────────────────────┤
│ Record 0  (record_size bytes)                            │
│   time_anchor:  u64                                       │
│   vector:       record_size − 8 bytes (modality-defined)  │
├──────────────────────────────────────────────────────────┤
│ Record 1                                                  │
│ ...                                                       │
│ Record N-1                                                │
└──────────────────────────────────────────────────────────┘

For embedding.f32.dim=768.bucketed:

  • record_size = 8 (time) + 768 × 4 (vector) = 3080 bytes.
  • A vector at index i: byte_offset(i) = 160 + i × 3080.

The byte range for record i is computable from the modality parameters alone — no header fetch needed, no in-Object offset table. The SDK at write time emits URIs of the form:

<bucket-address>#bytes:<160 + i*3080>-<160 + (i+1)*3080>

per 0002 §6.5 (URIs are universal byte-range references).

6.1.1 Spatial-index lineage validation (mandatory)

The spatial_index_hash field is mandatory and serves as defense-in-depth for the SpatialIndex-vs-Bucket consistency invariant. The SDK MUST perform this check at decode time:

  1. When a Bucket is fetched as part of resolving a query, compare its header's spatial_index_hash against the SpatialIndex Object hash the SDK is currently using for the query (loaded from the Manifest's registry).
  2. Direct match → silent success — proceed with normal vector decoding.
  3. Mismatch → walk the lineage chain (per 0004 §3.5): fetch the current SI Object, recursively follow its parents field (depth-first, first-parent), check whether the Bucket's spatial_index_hash appears anywhere in the chain.
  4. Found in chain → silent success. The Bucket was placed under an ancestor SI; readers MUST treat it as valid against the current SI, but only for queries that resolve to cells whose centroid position is preserved between the ancestor and the current SI. This per-cell preservation check is performed by the SDK's dispatcher using the current SI: if IvfCosine::hash_vector(query) → cell_id_current, and the ancestor's centroid at cell_id_current is byte-identical to the current SI's centroid at cell_id_current, the Bucket is in scope.
  5. Not found in chain (or chain truncated at the 100-ancestor cap) → critical error. The SDK MUST abort the decode, log the discrepancy, evict the Bucket from cache, and treat the read as backend corruption (surface to caller).

This catches an entire class of subtle bugs that are otherwise invisible until query results go silently wrong:

  • Cache mis-keying (the most common SDK bug — keying caches by (modality, spatial_key) instead of by content hash, causing stale-SpatialIndex Buckets to be served against new queries).
  • Multi-Manifest sessions that cache Buckets across Manifest boundaries where the SpatialIndex changed.
  • Forensic / debugging scenarios that need to verify which SpatialIndex generation produced a Bucket without re-fetching the Manifest.
  • Cross-implementation interop tests that need to assert "this Bucket was built with seed X."

The chain-walk mode (added in spec revision 2026-05) lets dreamdb-cli ada-ivf-step reuse Buckets for cells whose centroid was preserved across a rebuild — avoiding the O(N) re-dispatch that strict equality would force. Buckets for cells whose centroid CHANGED still require rewriting (the SDK's per-cell preservation check sees the centroid difference and refuses the chain walk).

Cost of the field at billion-scale: ~33 bytes × ~256K Buckets = ~8 MiB. Rounding error. Chain-walk cost: O(chain-depth) HTTP GETs per first-touch of an old Bucket, cached for the session.

The same field appears in the reference-mode Bucket header (§6.2). Vector-Storage Objects (§6.3) hold raw vectors not bound to any specific SpatialIndex generation, so they do NOT carry this field — they are addressed only via Bucket reference, and the Bucket's lineage check covers the path.

6.1.2 VectorCompressor lineage (per 0010 §8.2)

When a modality declares compress=<algo>:<param> (per 0010 §4), the Bucket header gains a parallel vector_compressor_hash field (33 bytes) immediately after spatial_index_hash. The fixed Bucket header grows to 200 bytes in that case (193 of payload + 7 bytes of zero-pad for 8-byte alignment).

For Buckets emitted by modalities WITHOUT compress=: the header remains the 160-byte v0 layout from §6.1 above. There is no "all-zeros compressor field" in the 160-byte form — the absence of the modality parameter is itself the signal. Decoders MUST consult the modality to determine which header size to read.

Lineage validation extends symmetrically: a Bucket carrying vector_compressor_hash = H MUST be read with the VectorCompressor Object whose hash matches H; mismatch is the same class of critical error as §6.1.1 (silent under-recall when wrong codebook is used).

6.1.3 MultiVectorIndex lineage (per 0015 §4.2)

ColBERT-style modalities (.multi-vec parameter) require a per-record doc_id: u64 alongside the existing time_anchor: u64, plus a header field multi_vec_index_hash: 33 bytes identifying the MultiVectorIndex Object that produced the records.

The Bucket header in that case grows to 240 bytes (200 of §6.1.2 payload + 33 multi-vec hash + 7 bytes of zero-pad). Records gain 8 bytes: record_size = 8 (time) + 8 (doc_id) + code_bytes.

Decoders dispatch on the modality string: presence of .multi-vec ⇒ 240-byte header; presence of compress= only ⇒ 200-byte header; neither ⇒ 160-byte header. The three header sizes are mutually exclusive at the modality-level — a Bucket header MUST match the modality's declared shape, mismatch is a critical error.

Pre-Phase-4 readers MUST reject Buckets of .multi-vec modalities. A reader that does not implement spec/0015 (i.e., does not understand the multi_vec_index_hash lineage field) MUST treat any Track with a .multi-vec modality as unsupported and refuse to query it. Silently reading the 200-byte (or 160-byte) prefix of a 240-byte header produces correct-looking bytes with garbled doc_id interpretation — a silent under-recall failure mode. This rejection is the protocol-level analog of spec/0010's "compressor lineage mismatch ⇒ critical error" discipline.

Lineage validation is the same pattern as §6.1.1 / §6.1.2: the SDK MUST verify multi_vec_index_hash matches the MultiVectorIndex Object in the Manifest registry before decoding.

6.2 Reference-vector mode (for multi-table at billion-scale)

When a modality has tables=L > 1, inline-vector mode would multiply storage by L (each vector appears in L buckets). Reference-vector mode decouples vector bytes from bucket membership:

  • Vector bytes are stored in per-Track Vector-Storage Objects (one or a few per Track, large packed arrays).
  • Bucket Objects hold byte-range references into the Vector-Storage Objects, not the vectors themselves.

Layout of a reference-mode Bucket Object:

┌──────────────────────────────────────────────────────────┐
│ Bucket Header (160 bytes, fixed)                         │
│   (same fields as §6.1, with mode flag = "ref"            │
│    and spatial_index_hash mandatory per §6.1.1)          │
├──────────────────────────────────────────────────────────┤
│ Reference 0  (49 bytes)                                  │
│   time_anchor:  u64                                       │
│   vec_obj_hash: 33 bytes (multihash of Vector-Storage)   │
│   byte_offset:  u64       (offset within Vector-Storage)  │
├──────────────────────────────────────────────────────────┤
│ Reference 1                                              │
│ ...                                                       │
└──────────────────────────────────────────────────────────┘

Storage savings at INGEST TIME (un-compacted):

  • Inline-vector, L=4 tables, 1B vectors at 3 KB: 12 TB total (4× duplicated).
  • Reference-vector, L=4 tables, 1B vectors at 3 KB: 3 TB vectors (single-copy) + ~200 GB references = ~3.2 TB. ~75% ingest-time savings.

Important caveat, 2026-05 (corrects a misleading framing in the 2026-04 draft): the 75% savings figure describes un-compacted reference-mode storage. Reference mode at L>1 requires per-cell compaction (per 0006 §7.3) to make queries fast — without it, every cell's records are scattered across the shared VS Object and queries do thousands of random reads. Compaction rewrites each (table, cell) cell into its own per-cell VS Object → post-compact storage equals L× the raw vector bytes, eliminating the dedup advantage. Empirically: dreamdb-bench 20M dim=192 tables=4 measured 19.4 GB ingest-time storage growing to 84.8 GB post-compact (4.4× growth, of which 4× is the irreducible per-table replication).

Mode selection guidance:

WorkloadRecommended modeWhy
tables=1, latency-sensitive ANNReference mode + per-cycle compaction1× post-compact storage; fast queries; no per-table replication
tables=L>1, latency-sensitive ANN (production)Inline modeSame post-compact L× storage as reference, but no compaction step → no writer-vs-compactor coordination cost (per 0008 §7.4)
tables=L>1, archival / rare queriesReference mode, no compactionKeep 1× ingest-time storage; pay slow first-query latency when needed
Streaming time-range queriesReference modeThe time-ordered VS layout matches the access pattern

Anti-pattern: tables=L>1 reference mode + compaction. Empirically (see /reports/realistic-distributed.log), this combination pays the L× post-compact storage cost AND the compaction wall-time cost AND the writer-vs-compactor coordination cost — for no advantage over inline mode. The 2026-04 draft positioned this as the production default; it should not be.

Stronger recommendation: use tables=1 (per 0004 §6.2.1) with max_hamming=2/probe_count=16-32 read-time multi-probe (per 0004 §6.5) before reaching for L>1. Empirical evidence: tables=1, max_hamming=2, probe_count=16 strictly Pareto-dominates tables=4, max_hamming=1, probe_count=4 on recall, latency, AND storage at the same per-query budget.

6.3 Vector-Storage Object format (reference mode only)

A Vector-Storage Object is a flat packed array of (time_anchor, vector_bytes) records, addressed at:

<timeline>/<modality>/vectors/<hash>

Layout:

┌──────────────────────────────────────────────────────────┐
│ VectorStorage Header (128 bytes, fixed)                  │
│   magic:         4 bytes  = 0x56535455 ("VSTU")           │
│   version:       u32      = 1                              │
│   record_size:   u32                                       │
│   record_count:  u32                                       │
│   modality:      32-byte ASCII (zero-padded modality tag  │
│                   prefix; for verification — same shape   │
│                   as Bucket Header §6.1)                  │
│   reserved:      remaining bytes set to 0                  │
│                   (16 + 32 + reserved = 128)               │
├──────────────────────────────────────────────────────────┤
│ Record 0 ... Record N-1 (each record_size bytes)         │
└──────────────────────────────────────────────────────────┘

A Bucket reference resolves a record in the target VS Object via byte_offset = 128 + record_index × record_size (VS Object's 128-byte header + packed records). Inline-mode Bucket arithmetic is parallel but uses the Bucket's larger 160-byte header (160 + idx × record_size); the two header sizes differ because Bucket Objects carry the spatial_index_hash lineage field while VS Objects do not.

6.3.1 VS Object scope: content-addressed and shareable across Tracks

VS Objects are content-addressed Objects like everything else in the DreamDB content layer. Their address path includes the modality (<timeline>/<modality>/vectors/<hash>) for storage organization, but their identity is the content hash.

Cross-Track sharing within the same modality is permitted and encouraged. When two Tracks of the same modality share vector content (e.g., a video re-published with different metadata, or a derivative computed multiple times deterministically), their VS Objects collide on hash and the second PUT is a no-op. The Bucket Objects in each Track reference the same VS Object via its content hash. Storage cost is paid once per unique vector-byte-set per modality, not per Track.

Operationally:

  • A writer publishing a new Track with reference-mode embedding does NOT need to check whether VS Objects already exist; they PUT by hash and the backend deduplicates (idempotent put-by-hash, 0005 §3.2).
  • The Track Object's Bucket entries reference VS Object hashes. Multiple Track Objects across multiple Manifests can reference the same VS Object — DreamDB's GC walk (per 0006 §7.3) MUST treat reachability as transitive across this reference: a VS Object is reachable iff any reachable Bucket references it.

Cross-modality sharing is NOT permitted by path — different modality tags produce different paths, hence different VS Object addresses even if the bytes are identical. This is intentional: cross-modality vector reuse implies the modalities have compatible interpretation, which DreamDB doesn't reason about. Operators who genuinely have cross-modality byte-equality SHOULD merge the modalities into one rather than relying on hash deduplication across paths.

6.3.2 GC reachability for VS Objects

The reachability walk (0006 §7.3.1) for reference-mode tracks:

For each Track in reachable manifests:
  For each Bucket Object referenced by Track's object_index:
    Mark Bucket reachable.
    Decode Bucket's references.
    For each reference's vec_obj_hash:
      Mark VS Object reachable.

Without this transitive walk, GC could DELETE VS Objects still referenced by live Buckets, producing dangling references and silent data loss. The conformance suite (0009 §7) includes a GC-with-reference-mode test asserting transitive preservation.

6.4 Bucket-splitting rule (resolves OQ-24)

The 100 MiB splitting rule applies to all bucketed Object kinds — Spatial Buckets (this section), Time-bucketed batch Objects (§8.2), and Fragments where applicable. Every conformant Bucket-Object emitter caps Object size at the configured threshold and splits when exceeded.

For Spatial Buckets specifically: when a Bucket Object would exceed 100 MiB of payload (encoded record bytes after header), the writer SPLITS:

  • Close the current bucket, hash, PUT.
  • Start a new Bucket Object with the same <spatial-key> but distinct content. New Object's address has a different content-hash.
  • Add a new entry to the Track Object's object_index with the same spatial_key and the new bucket's [t_start, t_end) time range — per 0002 §7.3.1's allowance for duplicate spatial_key_i.

100 MiB target chosen because:

  • At 3 KB/vector, 100 MiB = ~33,000 vectors per bucket — within the "250–4000" recommendation upper end (queries do exact KNN over union of fetched buckets).
  • Comfortably below S3 multipart-upload threshold (5 GiB) — single-shot PUT.
  • ~10 ms wire transfer at commodity 100 Gbps — bounded GET latency.

Operators MAY tune via modality parameter bucket-max-bytes=<N> (default 100 MiB; permitted range 1–500 MiB).

Multi-table × bucket-splitting interaction. When a modality is multi-table (tables=L > 1, per 0004 §6.2), a hot region in the underlying data triggers splits independently in each of the L tables. Index growth scales as L × splits × hot_regions:

  • A Track with 1000 hot regions and L=4, after each hot region splits 5×: 1000 × 4 × 5 = 20,000 extra bucket-index entries from the hot regions alone, on top of the baseline single-bucket-per-region count.
  • For multi-table Tracks at high write rates, operators SHOULD consider larger bucket-max-bytes (e.g. 200–300 MiB) to reduce split frequency. Larger buckets trade slightly higher per-bucket fetch latency for fewer index entries and less write-time bookkeeping.
  • The lineage validation check (§6.1.1) still applies per-table; each table's Buckets carry their own spatial_index_hash. Splits within a table all share that table's SpatialIndex.

6.4.1 Reader behavior across splits (mandatory)

When a query resolves to multiple Bucket Objects sharing the same <spatial-key> (because of splits), the SDK MUST:

  1. Logically merge the Buckets' contents into the candidate set for the query. All Items across all matching Bucket Objects participate in the exact KNN comparison.
  2. Order Buckets via the Track Object's object_index, sorted by t_start, NOT via the backend's list-prefix output. The backend's list-prefix returns Bucket Objects in content-hash lex byte order — unrelated to time. Ordering by t_start (from the index) gives a deterministic, time-meaningful sequence for downstream uses (e.g., presenting results sorted chronologically; pruning by query time-range).
  3. Filter by query time-range (when present in the query) using the per-entry [t_start_i, t_end_i) from the index — pre-filter Bucket Objects whose extent doesn't overlap the query window before fetching, saving GETs.

6.4.2 No sequence numbers in paths

Bucket-Object paths remain <timeline>/<modality>/<spatial-key>/<content-hash> with no sequence-number suffix. Adding a _001, _002, … suffix would:

  • Require writer coordination for monotonic sequence allocation — incompatible with 0000 §5.2's lock-free data plane.
  • Solve a problem that doesn't exist — splits produce different bytes (different items per split) → different BLAKE3 hashes → different paths. Naming collisions are impossible by construction.
  • Break content-addressing's "the path IS the hash" guarantee — sequence numbers add a non-content-derived component to the address.

Deterministic temporal ordering of splits is achieved via the Track Object's object_index (which records t_start_i per entry), not via path naming. The object_index IS the source of truth (Manifest Supremacy, 0005 §5.3.1); list-prefix is bootstrap-only and its hash-lex order doesn't matter for correctness.

6.6 Append-time fragmentation (LSM steady state, added 2026-05-22)

In addition to the size-based splits of §6.4, writers MAY emit multiple Bucket Objects per <spatial-key> as a steady-state property of the append path. Under the LSM retrofit (per design/0006 §B-LSM and spec/0021), append_many writes ONE new Bucket Object per touched cell with only the records from this batch; the Track Object's object_index accumulates multiple SpatialBucketEntry entries per spatial_key over time.

This is a structural property, not a transient migration:

  • Writes never read prior buckets. Each batch's write cost is O(new records) regardless of dataset size N. Per-cell-fragment count F grows monotonically until consolidated by compaction.
  • Reads MUST union all SpatialBucketEntry entries for the cells they probe. Behavior is identical to §6.4.1's reader contract for split-induced multiple-buckets-per-cell — the entries are joined, ordered by t_start, and contribute their records to the candidate set.
  • Compaction (per spec/0021) collapses F → 1 for cells where the operator chooses. Compaction is operator-driven (CLI / k8s CronJob), runs read-online, and is idempotent.

Why this is safe under content addressing. Each fragment is a content-addressed immutable Object; the Track Object's object_index is the only authoritative listing of which fragments belong to a cell (Manifest Supremacy, 0005 §5.3.1). Two fragments for the same <spatial-key> have distinct content hashes by construction and never collide on path.

Lineage constraint across fragments. All fragments for a given cell MUST share spatial_index_hash and vector_compressor_hash in their bucket header. Writers fail loudly when these differ across an append boundary (the operator must use a feature branch per 0008 §6). Compaction validates this constraint before consolidating (per 0021 §4) — fragments with divergent SI/VC cannot be merged.

No new structural format. The SpatialBucketEntry byte layout (§7.4 Table) is unchanged. Multi-bucket-per-cell already worked structurally via §6.4 splits; this section formalizes its append-time use.

6.5 Spatial+time segment order (resolves OQ-13)

For modalities that use both <spatial-key> and <time-bucket> segments (spatiotemporally-partitioned), the canonical segment order is:

<timeline>/<modality>/<spatial-key>/<time-bucket>/<hash>

Spatial-first. Rationale:

  • Feature queries dominate the query mix for spatially-bucketed tracks (that's why they're spatially bucketed). Spatial-first puts the most-selective dimension at the outermost prefix.
  • Time-only queries on a spatial-bucketed track are rare; when they happen, they list-prefix by <timeline>/<modality>/ and filter at the Track Object index level (not by list-prefix on the time-bucket segment). Manifest Supremacy applies regardless.
  • Storage-layout sharding on the outer prefix favors spatial cells, which is where contention is.

7. Index Page Format (resolves OQ-14, OQ-15, OQ-20, OQ-21)

Index Pages (per 0002 §7.2.2 / §7.3.2) are the B-tree pages that make 1M+ item tracks tractable.

7.1 Default fanout and page size (resolves OQ-14)

ParameterDefaultPermitted range
Fanout B25664–1024
Target page size16 KiB8–128 KiB
Max page size64 KiB(page-size cap)

Fanout = 256 chosen because:

  • log_256(10⁹) = 3.7 → tree height of 4 covers 1B+ items. Cold-start traversal = 4 GETs.
  • 256-entry pages × ~70 bytes/entry (inline form) ≈ 18 KiB — close to target page size.
  • Cache-friendly: fits in L2 of modern CPUs.

7.2 Inline-vs-paged threshold (resolves OQ-15)

Per 0002 §7.2.1 / §7.3.1: writers MUST switch to paged form when inline object_index exceeds 1 MiB of CBOR. Implementations MAY switch sooner; SHOULD switch sooner if the Track is expected to grow to >100K items.

The 1 MiB threshold balances:

  • Cold-start latency for small Tracks: inline form means one Track Object fetch covers everything. < 1 MiB at HTTP/2 is ~10 ms.
  • Write amplification under live ingest: at 1 MiB inline, every append rewrites the full Track Object. Beyond 1 MiB this becomes meaningfully wasteful; paged form avoids it.

7.3 Delta encoding (resolves OQ-20)

For Fragment-track and Time-batch Index Page leaves, time anchors within a single page are delta-encoded relative to the page's t_min:

IndexPage {
   "type":      "leaf",
   "modality":  "video.h264",
   "t_min":     <u64 absolute>,
   "t_max":     <u64 absolute>,
   "entries":   [
      [<delta_start>, <duration>, <byte_size>, <address>],
      ...
   ],
}

Leaf entries are positional CBOR arrays, not maps. The field order is fixed for every leaf-track-kind, documented per Track kind below. Future spec revisions MAY append fields after the v0 positions per 0002 §3.1.1's array-length-as-version rule; readers MUST tolerate longer arrays by ignoring trailing fields.

Per Track Kind (positional field order):

Track kindLeaf-entry array fields (positional)
Fragment-bearing (media)[delta_start, duration, byte_size, address]
Spatial-Bucket[spatial_key, delta_start, duration, byte_size, address, table_id?] (table_id present iff modality declares tables=L > 1)
Time-bucketed batch (events)[delta_start, duration, time_bucket, address]
Unbucketed[address] (single-element array; time anchor encoded into the address itself)
Constant(no leaf entries; Track Object holds a single constant_address)

Where:

  • <delta_start> = t_start - t_min (CBOR uint).
  • <duration> = t_end - t_start (CBOR uint).
  • <byte_size> = full byte length (CBOR uint).
  • <address> = 33-byte multihash (CBOR byte string).
  • <spatial_key> = base2 string (per 0002 §6.3.2).
  • <table_id> = small CBOR uint identifying which SpatialIndex table.

Storage savings vs. map-keyed encoding:

EncodingBytes per Fragment-track entryBytes saved per entry
Full anchors + map~82 B(baseline)
Deltas + map~70 B-12 B
Deltas + array~43 B-39 B (current spec)

For 1B-fragment Track index: ~43 GB savings vs. the original full-anchors-and-map form. ~31 GB savings vs. deltas-and-map. The schema-pinning cost is borne once in this spec; the storage win compounds across every Track at every scale.

The page header still carries t_min and t_max as full u64 absolute values (one per page) so the SDK can do range-overlap tests against the page without decoding entries.

7.4 Byte-size delta encoding (resolves OQ-21)

For Fragment tracks, byte_size per Fragment varies but the cumulative byte offsets within a single DreamDB bucket-time-segment are sequentially ordered. Index Pages MAY (not MUST) carry the byte_size field as-is or as a delta from a per-page baseline; v0 specifies as-is (per §7.3 above) — the savings from byte-size deltas are minor relative to time-anchor deltas, and the encoding complexity isn't justified at v0.

A future spec MAY introduce byte_size_delta if benchmarking shows meaningful additional savings.

7.5 Internal page entries

Internal (non-leaf) Index Pages:

IndexPage {
   "type":      "internal",
   "modality":  "video.h264",
   "t_min":     <u64>,
   "t_max":     <u64>,
   "entries":   [
      {
      [<child_t_min>, <child_t_max>, <child_address>, <child_item_count>],
      ...
   ],
}

Internal entries are also positional CBOR arrays, with field order [child_t_min, child_t_max, child_address, child_item_count]. Internal entries carry full u64 anchors (not deltas) — internal pages are far less numerous than leaf pages, so delta savings are negligible relative to leaf-level savings, and full anchors keep child-page selection arithmetic simple.

7.6 Dynamic tree height (no hard ceiling)

Tree height is not capped by the spec. As a Track grows, the B-tree grows upward via standard root-creation:

  • When a write would cause the current root Index Page to exceed the page-size target (§7.1):
    1. The writer creates a new internal Index Page.
    2. The old root becomes one child of the new internal page.
    3. The new entry is added as a sibling child (after a leaf-level split ripples up to the root).
    4. The new internal page becomes the root.
    5. The Track Object's tree_height field is incremented by 1.
  • The Track Object always points at the current root via its object_index.root field; readers traverse from there using tree_height for depth-first navigation.

Capacity at successive heights with default fanout B = 256:

HeightMax items per TrackPractical workload
1256tiny test data
265 Ksmall tracks
317 Mmedium tracks
44.3 Btypical billion-scale workloads
51.1 Thigh-frequency sensor data over years
6280 Textreme; effectively unbounded for any practical workload
772 Pfar exceeds any realistic Track
818 E (1.8 × 10¹⁹)beyond i64 ranges entirely

v0 implementations SHOULD support up to height = 8 (covering ~10¹⁹ items — beyond any practical concern). Implementations MAY cap lower for memory bounds; if they do, exceeding the cap MUST surface to the writer as a clear error rather than silently overflowing or losing data.

The previous root remains addressable (immutable, content-addressed). It just no longer points at the current top of the tree — the Track Object's object_index.root does. Older Manifest versions referencing the old root are unaffected by the height growth, since they reference a different (older) Track Object that points at the old root.

This is the same B-tree growth pattern used by every immutable persistent B-tree (CouchDB, LMDB-style designs, Git pack files). It is fully compatible with DreamDB's content-addressing and copy-on-write discipline.

8. Time-Bucketed Batch Format (resolves OQ-8)

Time-bucketed batch Objects (event Tracks) hold sparse (time_anchor, payload) records sorted by time anchor.

Empty-batch rule: a Time-batch Object MUST have item_count ≥ 1. Empty Time-batches (header-only, no items) are malformed; writers MUST NOT emit them; readers MUST reject them as backend corruption. A Track with zero events at all is encoded by an empty Track-Object object_index (per 0001 §4.5) — no Time-batch Objects at all, rather than an empty Time-batch Object.

8.1 Layout

The index is placed immediately after the header, before the item payload bytes. This keeps header + index fetchable in a single ranged GET (e.g. Range: bytes=0-<index_end>), enabling event lookups in two backend round trips total (one for header+index, one for the matching event).

┌──────────────────────────────────────────────────────────┐
│ Batch Header (64 bytes, fixed)                           │
│   magic:         4 bytes = 0x56424154 ("VBAT")           │
│   version:       u32     = 1                             │
│   bucket_t_min:  u64     (nominal bucket start; ns)      │
│   bucket_t_max:  u64     (nominal bucket end; ns)        │
│   item_count:    u32                                     │
│   index_size:    u32     (= item_count × 16 bytes)       │
│   reserved:      remaining bytes zero                    │
├──────────────────────────────────────────────────────────┤
│ In-Object Index (item_count × 16 bytes)                  │
│   Entry 0                                                 │
│     time_anchor: u64                                     │
│     byte_offset: u32   (relative to start of Object)     │
│     byte_size:   u32                                     │
│   Entry 1                                                 │
│   ...                                                    │
├──────────────────────────────────────────────────────────┤
│ Item bytes (variable size; located at byte_offset values  │
│             from the index, all ≥ 64 + index_size)       │
│   item 0 payload                                         │
│   item 1 payload                                         │
│   ...                                                    │
└──────────────────────────────────────────────────────────┘

The in-Object index makes byte ranges computable in two backend round trips:

  1. GET <object> with Range: bytes=0-<64 + index_size> → fetches the header and the full index in one read (typically a few KB).
  2. SDK reads the index, locates the target event by time_anchor, then either emits a bytes:<byte_offset>-<byte_offset + byte_size> URI (per 0002 §6.5) or issues a follow-up GET for the event payload.

For batch queries that select N events, step 2 fans out to N parallel HTTP/2 GETs against the same Object — still amortizes well within a single connection.

Why index-at-start (not footer): the alternative footer layout requires either an HTTP suffix-range request (Range: bytes=-N, supported by all major backends but not universally), or a HEAD-then-GET to discover Object length first, adding a third round trip. Header-prefixed index is universal across backends and avoids the round trip. This is the "in-Object offset table" pattern from 0002 §6.5.3 in its canonical placement.

8.1.1 byte_offset semantics — measured from start of Object

Each in-Object index entry's byte_offset value is the absolute offset from the start of the Object (byte 0), NOT a relative offset within the Item bytes section. This makes byte-range URI emission direct: an event's bytes:<start>-<end> URI is computed as bytes:<byte_offset>-<byte_offset + byte_size>, with no further arithmetic.

The structural consequence: every byte_offset value is ≥ 64 + index_size (the minimum is just past the index region — the first item payload). Implementations that follow an earlier draft's "offset within Item bytes section" interpretation (relative to the start of the Item bytes section, zero-based at the Item-bytes-region origin) are non-conformant — their emitted addresses will be off by (64 + index_size) bytes, silently producing wrong byte ranges.

Worked example. A Time-batch with item_count = 3, index_size = 48 (3 × 16):

Object layout:
  bytes [   0,   64) → Header
  bytes [  64,  112) → In-Object Index (3 entries × 16 bytes)
  bytes [ 112,  ...) → Item bytes

Index entries (after header):
  Entry 0:  time_anchor = 152_481_000_000
            byte_offset = 112        (first item starts immediately after index)
            byte_size   = 200
  Entry 1:  time_anchor = 152_500_000_000
            byte_offset = 312        (= 112 + 200)
            byte_size   = 150
  Entry 2:  time_anchor = 152_600_000_000
            byte_offset = 462        (= 312 + 150)
            byte_size   = 250

URI for the second event:
  dreamdb:///<...>#bytes:312-462
  
  → HTTP Range: bytes=312-461   (Connector applies half-open → inclusive)

The SDK never adds 64 + index_size to a stored byte_offset — that adjustment is implicit in the value as written. Conformance test vectors (0009 §5.4) include a Time-batch round-trip that asserts emitted URIs match expected absolute offsets.

8.1.2 Splitting and item-count guidance for Time-batches

The 100 MiB splitting rule from §6.4 applies to Time-batch Objects: when a Time-batch would exceed bucket-max-bytes (default 100 MiB; configurable per modality), the writer SPLITS — emits a new Time-batch Object with the same <time-bucket> segment but a different content-hash; adds a new entry to the Track Object's object_index.

Beyond raw-bytes splitting, Time-batches have an additional sizing concern from index size. The header+index region is fetched in step 1 of every event lookup (§8.1); large item_count makes that first GET expensive:

  • item_count = 1K → index ~16 KB (one fast ranged GET).
  • item_count = 100K → index ~1.6 MB (still single ranged GET; ~30 ms wire transfer).
  • item_count = 1M → index ~16 MB (slow first lookup; defeats the header+index locality).

Writers SHOULD target ~100K events per Time-batch Object for high-volume event modalities — this lands the index at ~1.6 MB while leaving room for typical event payloads (50–500 bytes each) under the 100 MiB cap. Operators tune bucket=<duration> to match expected event rate to this target:

Event rateRecommended bucket= for ~100K events/batch
10 events/sbucket=3h (~108K events)
100 events/sbucket=15m (~90K events)
1 K events/sbucket=2m (~120K events)
10 K events/sbucket=10s (~100K events)
100 K events/sbucket=1s (~100K events)

The exact cap is operator-tunable; the table is guidance for the typical sweet spot. Time-batches that exceed item_count × byte_size ≥ 100 MiB MUST split (size-driven); operators MAY split sooner (count-driven) by lowering bucket=<duration> to keep first-lookup latency bounded.

8.2 Concurrent constant correction conflicts (resolves OQ-7)

Two writers concurrently publishing higher-layer Constants of the same modality (e.g., two corrections to title.text) produce two Layer Tracks, both referenced by the new Manifest with role: layer-of:<base-title-track>. The Manifest's tracks list contains both.

Resolution rule for readers: when the Manifest contains multiple layer-of Tracks for the same (timeline, modality) Constant, the lexicographically-greatest layer Track address wins. Deterministic, content-addressable, no central tiebreaker.

For non-Constant tracks (Continuous Signal, Discrete Event), multiple layer-of Tracks for the same (timeline, modality) are interpreted as union — readers see Items from all layers. Conflicts at the same (time_anchor, content_hash) collapse naturally (same hash = same Item).

This rule favors deterministic reproducibility over "most recent" — ts fields aren't trusted for resolution because writer clocks aren't trusted (per 0003 §11). A writer that wants "last-writer-wins" semantics manually orders its Publish to be after the other writer's, and the layer-Track address comparison falls in its favor by happening to be lexicographically later.

9. Multipart Upload (deferred per OQ-31)

The bucket-splitting rule (§6.4) caps Bucket Objects at 100 MiB; Fragment Objects are typically under 50 MiB. Multipart upload (S3 default threshold 5 GiB) is not required for v0 — every Object fits in a single PUT. A future spec MAY relax bucket-max-bytes; that would be the time to define multipart.

10. Out of Scope for this Document

  • Vector encoding standards beyond what the modality declares (embedding.f32.dim=N). The application produces the f32 values; DreamDB stores them.
  • Codec choice beyond CMAF / fragmented MP4 for media. Specific codecs (H.264 vs AV1 vs HEVC) are modality parameters; the container format is the spec's concern.
  • Encryption-at-rest formats. SSE is a backend-level concern, transparent to DreamDB.
  • Compression of CBOR Objects. Manifests, Track Objects, Index Pages are deterministic CBOR — gzip would defeat hashing if applied inconsistently. Backends MAY enable transport-level compression (HTTP gzip/brotli on the wire); content-layer compression is forbidden.

11. Open Questions Surfaced by This Document

  • OQ-36 (→ 0009 §5.4): Conformance test vectors for byte-format round-trip. Resolved: full battery in 0009 §5.4 covering CMAF Fragment timing → byte-range computation; Spatial Bucket inline / reference modes; Time-batch in-Object index; Index Page delta encoding; lineage validation; dynamic B-tree height growth; multi-bucket merge across splits.
  • OQ-37 (→ v0.1): Media modalities beyond video.h264 / audio.opus. AV1 / HEVC / FLAC / VP9 — all use fragmented MP4, but per-codec parameter declarations need spec text.
  • OQ-38 (→ 0010): PQ-compressed vector storage layout. Resolved: 0010 §8 generalizes the §6.1 inline-vector format to arbitrary record_size via the vector_compressor Manifest registry field. The 160-byte header extends to 200 bytes when compress= is declared, adding a vector_compressor_hash lineage field (per 0010 §8.2 / §6.1.2 above).

Next: 0008-versioning-collab.md — Manifest history walks, branching, merging, multi-writer reconciliation, and the conflict-resolution semantics referenced from §8.2 of this document.