DreamDB Specification — 0002: Content Addressing & Address Grammar
Status: Draft. Builds on
0000-overview.mdand0001-data-model.md. This document fixes: the hash function, the canonical encoding for hashable Objects, the full address grammar, the modality-tag grammar (resolving0000OQ-6), and per-entity address derivation. Time-key encoding details are deferred to0003; spatial-key derivation is deferred to0004.
1. Purpose
0001 defined what the DreamDB entities are. This document defines what their bytes look like and what their addresses look like — the two prerequisites for content-addressing.
By the end of this document, the following are concrete:
- The hash function used everywhere in the spec.
- The canonical byte encoding of every hashable Object (Genesis, Track, Manifest, Bucket/Fragment, etc.).
- The string form of an address — usable both as a URI and as a backend object key.
- The modality-tag grammar.
- How each entity's address is derived from its bytes and its context.
Things still deferred:
- The time-key encoding (
0003) — how a timestamp becomes the time-keyed segment of an address. - The spatial-key derivation algorithm (
0004) — how a vector becomes a spatial-key bit-string. This document fixes the slot for the spatial key and its encoding, but not how the bits are computed. - The per-modality Object byte format (
0007) — what's inside a Fragment, a Spatial Bucket, or a Time-bucketed batch.
2. Hash Function: BLAKE3-256
DreamDB uses BLAKE3 (output truncated to 256 bits) as the cryptographic hash everywhere a content hash is needed.
Rationale:
- Throughput. ~6.8 GB/s single-threaded on modern x86; trivially parallelizable. The 1B-vector ingest path is hash-bound at scale; SHA-256 (~1 GB/s without SHA-NI) becomes the bottleneck.
- Tree-hashed structure. BLAKE3 is internally a Merkle tree over 1 KiB chunks. This means a Fragment or Bucket's content hash carries free subrange-verification capability: an SDK can verify that a fetched byte range matches the advertised hash without rehashing the whole Object. Aligns directly with DreamDB's Fragment-by-byte-range and Bucket-by-array-index access patterns.
- 256-bit output. Collision probability ~2⁻¹²⁸ for birthday attacks — adequate forever.
- Single-spec. No keyed/unkeyed, MAC, or KDF mode confusion at the protocol level. DreamDB always uses unkeyed BLAKE3 for content addressing.
A hash value is therefore always 32 bytes. Wherever this document writes <hash> without qualification, it means "32 bytes of BLAKE3-256 output."
2.1 Multihash prefix
To preserve future hash-function evolution, every hash value that appears as part of a string-form address carries a one-byte algorithm tag:
| Algorithm tag (hex) | Meaning |
|---|---|
0x1e | BLAKE3-256 (aligned with IPFS multihash table per 0009 §3.1) |
0x12 | SHA-256 (reserved; not used in v0) |
| Other | Reserved |
The tag is prepended to the hash bytes before string encoding (§8). Concretely, every hash in an address is 33 bytes on the wire: 0x1e || <32-byte-hash>.
3. Canonical Encoding: Deterministic CBOR
Every hashable Object in DreamDB (Genesis, Track, Manifest, plus internal sub-structures) is serialized using deterministic CBOR (RFC 8949 §4.2 — "Core Deterministic Encoding Requirements"). The hash is computed over the deterministic-CBOR bytes.
Why CBOR-deterministic:
- Determinism. Two writers given the same logical Object produce byte-identical bytes (and therefore the same hash). This is what makes content-addressing work across implementations.
- Compactness. Smaller than JSON; comparable to MessagePack but with an actual deterministic-encoding standard.
- Tagged types. First-class support for byte strings, integers, maps, and a tag system — no string-escaping gymnastics for binary data like 32-byte hashes or 16-byte nonces.
- Mature deployments. Used by IPLD-DAG-CBOR, COSE, and CWT.
3.1 Subset rules
To eliminate ambiguity, DreamDB restricts deterministic CBOR slightly beyond RFC 8949:
- Maps vs. positional arrays — schema-typed DreamDB Objects use one of two encodings, chosen per-schema:
- Maps with string keys (default for low-volume schemas): self-describing, debugger-friendly. Used for top-level Manifest, Genesis, SpatialIndex Object, and Track Object metadata. Integer-keyed maps are FORBIDDEN.
- Positional CBOR arrays (mandatory for high-volume schemas — Index Page leaf entries and internal entries per
0007§7.3 / §7.5): no field names; field order pinned in spec text. Saves ~40 bytes per entry; ~40 GB on a 1B-entry Track index.
3.1.1 Positional array forward-compatibility (array-length-as-version)
Positional arrays in DreamDB use the array length as an implicit schema version. Future spec revisions (v0.1+) MAY append new fields after the spec'd field count without breaking v0 readers, provided readers obey the discipline:
- Writers MUST emit at minimum the v0 field count for each schema. Writers MAY append additional fields (per a future spec revision) after the spec'd ones.
- Readers MUST iterate the array up to their known field count and ignore any trailing fields. They MUST NOT reject an array that has more fields than they expect — this is the forward-compatibility hatch.
- Readers MUST reject an array shorter than their known field count — this is a malformed write.
This means: v0.1 can add a byte_size_delta field to Fragment-track leaf entries by appending to the array. v0 readers see the array, read fields 0–3 (the original four), ignore field 4 onward. v0.1 readers read all five. No breaking change.
The v0 spec's positional field counts (0007 §7.3 / §7.5) are the minimum for each schema. Future fields go at the end; renumbering the existing positions is a breaking change forbidden within v0.x.
3.1.2 No-re-emit invariant
The forward-compat hatch above creates a subtle hash hazard if SDKs re-encode Objects they didn't author:
- A v0 SDK fetches a v0.1 Index Page with 5-field leaf entries.
- The fetched bytes' BLAKE3 matches the address (correct — SDK hashes all bytes received).
- The SDK decodes entries into its 4-field internal representation, dropping field 5.
- If the SDK then re-encodes that page (in a derived computation, GC inventory, replicated write, etc.), the re-emitted bytes are 4-field — different bytes, different hash, broken address chain.
To prevent this class of silent corruption:
DreamDB SDKs MUST NOT re-encode Objects they did not originally author. Cached Object bytes are kept verbatim from the backend. Any logical "re-derivation" that produces new bytes also produces a new content hash and is therefore a new Object, addressed at its new hash, never written under the old hash's path.
Operationally:
- Caches store fetched bytes verbatim (or in a representation that allows byte-identical re-emission); they MAY also keep a decoded representation alongside but MUST NOT prefer it for any output that goes back on the wire.
- GC walks identify Objects by their backend addresses; they don't decode-and-re-encode for inventory.
- Federation / mirroring copies bytes verbatim across backends; never re-encodes.
- Merge operations (per
0008§5) construct new Manifests with their own hashes; they don't modify existing Manifests.
This applies to all CBOR-encoded DreamDB Objects, not just Index Pages — Manifests, Track Objects, Genesis, SpatialIndex Objects all share the rule. The positional-array forward-compat hatch makes the hazard most visible, but the rule is universal.
3.1.3 Map forward-compatibility (unknown-key tolerance)
DreamDB schema-typed maps (per §3.1's "Maps with string keys" pattern) follow a parallel forward-compatibility rule:
- Writers MUST emit all spec-required keys for the schema version they target. Writers MAY emit additional, schema-defined OPTIONAL keys, and MAY emit keys defined by a future spec revision.
- Readers MUST ignore any unknown string key without rejecting the map. An unknown key is the forward-compatibility hatch for adding new optional fields (e.g.,
space_config,hot_shard,vector_compressorwere added by later specs without breaking v0 readers). - Readers MUST reject a map missing a spec-required key — this is a malformed write.
This is the map-level analog of the positional-array rule in §3.1.1. The two rules together cover both encoding shapes; new spec versions extend both array tails and map key-sets safely.
The no-re-emit invariant of §3.1.2 applies identically: a reader observing a map with unknown keys MUST NOT re-encode the map without those keys — that would change the hash. Caches keep bytes verbatim.
- Tags — DreamDB defines its own tag range (see §3.2). External CBOR tags are not interpreted by the protocol.
- Floats — protocol Objects MUST NOT contain floating-point numbers in fields that affect hashing. (Float canonicalization is a known pitfall.) Vector payloads inside Bucket Objects are not part of this restriction — they live opaquely inside payload byte strings, not as CBOR floats.
- Indefinite-length items — forbidden everywhere (deterministic CBOR already forbids them; this is a reminder).
3.2 DreamDB CBOR tag
A single CBOR tag is reserved for use when DreamDB byte sequences are embedded in foreign / generic CBOR documents (outside DreamDB's own schema-typed Objects), where readers need the tag to disambiguate DreamDB content from generic strings/integers/byte-strings.
| Tag | Meaning |
|---|---|
dreamdb.tag | Wraps any DreamDB value (hash, time anchor, modality tag, spatial key). The inner type is determined by inspecting the wrapped value. |
The concrete tag value is fixed in 0009 §3.2.
Inside DreamDB's own schema-typed Objects (Genesis, Manifest, Track Object, Index Pages, etc.) this tag is NOT used — the schema's field name (or array position, for positional encoded entries per §3.1) is authoritative about each value's type. The tag exists only for the foreign-CBOR-embedding case.
If your application doesn't embed DreamDB values in foreign CBOR, you'll never encounter this tag.
4. The Address Shape
DreamDB uses one address grammar that is simultaneously the protocol-level identifier of an Item and the backend object key of the Object containing it. The "address IS the path" — the SDK never translates between two addressing systems.
The grammar has the two-part structure introduced in 0000 §5.3, with the bucketing decomposition from §3.1 of 0001:
<object-address> is what is sent to the backend as a get/list key. <intra-object-locator> is consumed by the SDK locally (after the Object is fetched) to extract the specific Item. The two parts are separated by a # (URL fragment-style) when both appear in a string-form address.
4.1 Why this layout
Order of segments matters:
- Timeline ID first. Every prefix-list query is scoped to one Timeline, so the Timeline ID is the outermost key. Backends can shard storage on this prefix without coordinating with DreamDB.
- Modality second. A query is always against one modality at a time (you do not "search video and audio together"). Modality narrows the address space cheaply.
- Spatiotemporal key third. The variable-shape segment that carries the queryable structure (time, spatial, or both — see §6.3).
- Content hash last. Disambiguates collisions (per
0000§5.3) and is the only segment derived from the bytes of the Object itself; it is unknown until the Object is built.
This ordering means:
- Spatial query →
list-prefix(<timeline-id>/<modality>/<spatial-key>/)returns matching Bucket Objects. - Time-range query →
list-prefix(<timeline-id>/<modality>/<time-bucket>/)returns matching Time-batch / Fragment / Constant Objects. - Cross-time-and-spatial query (for spatial-indexed tracks with time partitioning) →
list-prefix(<timeline-id>/<modality>/<spatial-key>/<time-bucket>/).
5. Modality-Tag Grammar (resolves OQ-6)
A modality tag is a structured string that names the type of a Track. It identifies:
- What the payload bytes mean.
- Which Track kind the track is (
continuous/event/constant). - Which Object kind (Fragment / Spatial Bucket / Time-batch / unbucketed) the Track uses.
- For parameterized modalities, the parameter values (e.g. dimensionality for embedding vectors).
5.1 Grammar
- All segments are lowercase ASCII;
[a-z0-9_]for body,=allowed only insideparamsegments. - Hyphens are forbidden inside segments (so segment boundaries are unambiguously
.). - Total tag length: max 256 bytes. (Earlier drafts of this spec specified 128; bumped to accommodate realistic reverse-DNS prefixes plus multi-table modality parameters.)
5.2 Built-in modality classes (v0)
Built-in classes do not require a reverse-DNS prefix. The set is small and fixed in v0:
| Class | Track kind | Object kind | Example tag |
|---|---|---|---|
video | continuous | Fragment | video.h264, video.av1, video.hevc |
audio | continuous | Fragment | audio.opus, audio.aac, audio.flac |
embedding | continuous | Spatial Bucket OR unbucketed | embedding.f32.dim=768.bucketed, embedding.f32.dim=128 |
transcript | event | Time-bucketed batch | transcript.turn |
annotation | event | unbucketed (low-volume) OR Time-bucketed batch | annotation.json |
scene | event | unbucketed | scene.boundary |
sensor | event | Time-bucketed batch | sensor.gps, sensor.imu |
title | constant | unbucketed | title.text |
author | constant | unbucketed | author.text, author.json |
license | constant | unbucketed | license.spdx |
source | constant | unbucketed | source.uri |
description | constant | unbucketed | description.text |
For embedding, the bucketed parameter (presence flag, no =value) selects Spatial Bucket Objects; absence means unbucketed (one Object per vector — only viable for small tracks).
For event classes, the .bucket=<duration> parameter selects Time-bucketed batch storage with the given bucket duration (e.g. transcript.turn.bucket=10s); absence means unbucketed.
5.3 User-defined modalities
Applications that need a modality not in the built-in table use reverse-DNS namespacing:
The first three segments must form a reverse-DNS path the application controls. This prevents two unrelated applications from accidentally adopting the same modality tag.
Built-in tags are reserved: a user-defined tag MUST NOT use video, audio, embedding, transcript, annotation, scene, sensor, title, author, license, source, or description as its first segment.
5.4 Track kind & Object kind from the tag alone
A reader that sees a modality tag MUST be able to determine the Track kind and Object kind without consulting any external registry. Implementations achieve this by:
- Parsing the first segment against the built-in class table (§5.2), or
- For namespaced tags, requiring that the user-defined class register a Track-kind / Object-kind mapping in the
TrackTypeRegistryfield of the Manifest's space-config sub-Object (§7).
A Manifest that references a user-defined modality without a corresponding registration is invalid; readers MUST reject it.
6. Address Components
This section pins down each segment of the address.
6.1 Timeline ID
The Timeline ID is the BLAKE3-256 hash of the deterministic-CBOR encoding of the Timeline Genesis Object (per 0001 §5.1), prefixed with the multihash algorithm tag (§2.1). 33 bytes on the wire; 53 base32 characters in string form (§8).
The Timeline ID is globally unique by the cryptographic argument in 0001 §5.2. Two writers never accidentally share a Timeline; they share one only if they explicitly exchange the Genesis Object.
6.2 Modality tag
The modality tag (§5) appears verbatim in the address as an ASCII segment. Length is bounded at 256 bytes.
6.3 Spatiotemporal key
This is the variable-shape segment. The shape depends on the modality's Object kind:
| Object kind | Spatiotemporal key shape | Notes |
|---|---|---|
| Fragment (media) | <time-bucket> | Storage-layout hint (see §6.3.1). Placement: floor(t_start / bucket-duration). |
| Spatial Bucket | <spatial-key> or <spatial-key>/<time-bucket> | Time-bucket included iff the modality declares spatiotemporal partitioning. |
| Time-bucketed batch | <time-bucket> | Storage-layout hint (see §6.3.1). Placement: floor(t_start / bucket-duration). |
| Unbucketed (Item = Object) | <time-anchor> | Exact time anchor; encoding per 0003. This is the query primitive (no separate index). |
| Constant | (empty) | Coverage is "all of time"; modality alone identifies the constant. |
<time-bucket> and <time-anchor> encodings are deferred to 0003. <spatial-key> derivation is deferred to 0004; this document fixes only its encoding format:
The spatial-key bit length is part of the modality parameters: e.g. embedding.f32.dim=768.bucketed.spatial-bits=18 means 18-bit spatial keys → 18-character keys → up to 2¹⁸ ≈ 262K Spatial Bucket Objects.
6.3.2 Why base2 (and not base32) for spatial keys
Spatial keys carry structured bits whose prefix relationships drive list-prefix queries. Any encoding that aligns to a multi-bit character size (base16 → 4 bits/char, base32 → 5 bits/char) requires padding when the bit length is not a multiple of the alignment, and that padding silently destroys prefix preservation:
The trailing zero pad on the parent contaminates char #2 with bits the child doesn't share, breaking list-prefix-based spatial queries silently.
Base2 sidesteps this entirely: the encoding is literally the bit string, so character-prefix and bit-prefix are the same relationship by construction. A 14-bit prefix query of an 18-bit modality truncates the spatial-key string to 14 chars — no rounding, no encoding gymnastics.
The cost is verbosity: a 20-bit key takes 20 chars instead of 4 base32 chars. Within a DreamDB address that already includes a 56-char Timeline ID and a 56-char content hash, the marginal length is rounding error. Hashes elsewhere in addresses keep base32 (per §8.1) because they are opaque values where conciseness matters; spatial keys keep base2 because they are structured bit strings where prefix semantics matter. Different roles, different encodings.
6.3.1 The time-bucket is a storage-layout hint, not a query primitive
For Fragment- and Time-batch-bearing Tracks, the <time-bucket> segment in an Object's address is purely a storage-layout hint. Its purpose is to give backends a natural prefix on which to shard a million Objects without re-coordinating with DreamDB. It is not the source of truth for time-range queries.
The source of truth is the object_index field on the Track Object (see §7.3), which records the exact time extent of every Object: [(t_start_i, t_end_i, ..., address_i), ...] for Fragments; [(time_bucket_i, batch_address_i), ...] for Time-batches.
Placement rule. A Fragment or Time-batch Object whose contained Items span the time range [t_start, t_end) is placed at the bucket determined by its start:
A Fragment that crosses a bucket boundary (e.g. covers [59.9s, 60.1s) with 60s buckets) goes into bucket 0. The fact that some of its Items live in clock-time bucket 1 is irrelevant to its placement.
Time-range query rule. A reader answering a time-range query MUST consult the Track Object's object_index — never list-prefix on the <time-bucket> segment alone. Concretely:
- Read (and cache per session) the Track Object.
- Iterate its
object_index, selecting Objects whose[t_start, t_end)overlaps the query range. - Issue ranged-GETs against the matching Objects' addresses.
A query for t = 60.05s against the example Fragment above succeeds because the fragment-index records t_start = 59.9, t_end = 60.1, and interval-containment is checked against those exact values — independent of the storage bucket.
list-prefix(<timeline>/<modality>/<time-bucket>/) is appropriate only as a bootstrap discovery mechanism — when an SDK is enumerating Objects from a backend it has never seen and has no Track Object for. Even in that case, the SDK uses the result as a candidate set and relies on the Track Object (once retrieved) for exact selection.
Recommendation (SHOULD). Writers SHOULD ensure max(item-duration) ≤ bucket-duration for the modalities they emit. Violating this does not break correctness — the object_index handles all cases — but it gradually unbalances storage layout (lots of Fragments straddling boundaries cluster in earlier buckets).
This rule does not apply to Spatial Buckets: every vector has exactly one deterministic spatial-key, so there is no boundary-spanning ambiguity. Spatial-key prefix listing remains a primary query primitive for feature queries.
6.4 Content hash
The BLAKE3-256 hash of the Object's bytes (whatever the Object's internal format), prefixed with the multihash algorithm tag. 33 bytes on the wire; 53 base32 chars in string form.
For a Genesis Object, the <content-hash> is the Timeline ID — Genesis Objects address themselves by their own hash.
6.5 Intra-object locator
Present only when the Object is a bucket containing multiple Items. All external locators use a single byte-range form — there is exactly one locator syntax in dreamdb:// URIs, regardless of Object kind:
The locator is appended to the object-address with a # separator:
For unbucketed Items (Item = Object), the locator is empty and the # is omitted.
6.5.1 Why a single byte-range form
A dreamdb:// URI is self-explanatory at the fetch level: any tool that speaks HTTP Range or an equivalent backend primitive can fetch the bytes without knowing the modality, the Object's internal layout, or anything beyond the URI itself. The modality tag (already in the path) tells the SDK how to decode the bytes once fetched; the byte-range locator tells anyone how to fetch them.
This gives:
- Universal portability. An SDK that has never seen the modality can still copy, archive, or proxy the Item. It just can't decode the bytes — which is unavoidable without a decoder.
- Native cacheability. Byte-range URIs map directly to HTTP Range requests, so CDNs and edge proxies cache them transparently. No special-case DreamDB logic needed in intermediate caches.
- Grammar simplicity. One locator syntax, no per-Object-kind dispatch in URI parsers.
6.5.2 Internal logical references vs external URIs
An SDK MAY reason internally in terms of logical references — idx:1247 for a vector inside a Spatial Bucket, (time-anchor, payload-hash) for an event inside a Time-batch — when it has the Track Object cached and is doing local lookups. What it MUST NOT do is externalize those logical references as dreamdb:// URIs. Anything that crosses the boundary out of the SDK (returned to the application, written into a manifest, shared between SDKs, embedded in another document) MUST be the byte-range form.
6.5.3 Computing the byte range at mint time
Converting a logical reference to a byte range requires knowing the Object's layout. 0007 defines two layout patterns that make this conversion possible without fetching the whole Object:
- Fixed-size records (the common case). For modalities like
embedding.f32.dim=N, every record is a known fixed size;byte_offset(idx) = header_size + idx × record_size. The SDK computes the range from the modality parameters alone. - In-Object offset table (fallback for variable-size payloads). The Object begins with a small
[(time_anchor_i, byte_offset_i, byte_size_i), ...]table; the SDK fetches just the table (a small ranged GET against the Object's first KB), looks up the record, and emits a URI carrying the resolvedbytes:range.
The choice of pattern per modality is fixed in 0007. The Track Object's object_index MAY also carry the per-Item byte ranges inline (avoiding even the small lookup GET) when the modality declares it.
7. Per-Entity Address Derivation
This section walks through every protocol entity and shows exactly how its address is computed.
7.1 Genesis Object
A Genesis Object is the seed of a Timeline (0001 §5.1). Its CBOR encoding contains:
Address (= Timeline ID):
A Genesis Object has no Timeline ID prefix — it is identity-bearing rather than identity-anchored. Backends store it at the canonical key genesis/<multihash>.
7.2 Manifest
A Manifest enumerates the state of a Space at one moment (0001 §7). Its CBOR encoding contains:
The parents field is an array of multihashes, supporting linear advance (one parent), root Manifests (empty array), and merges (multiple parents). DAG semantics are pinned in 0008 §2.
7.2.0 SpaceConfig sub-Object
The OPTIONAL space_config field carries Space-wide policy: encryption mode (0019), per-tenant quotas (0018), capability-token issuer keys (0012), etc. It is a CBOR map with the following well-known sub-fields, each independently optional:
Absent space_config ⇒ all sub-fields take their default values (no encryption, no quotas, single-tenant). Sub-fields not present in a given space_config ⇒ that sub-field's default applies (the operator opts into encryption and quotas independently).
Per the map-extensibility rule in §3.1.1, readers MUST ignore unknown sub-fields of space_config rather than rejecting the Manifest. New spec versions add fields without breaking pre-existing readers.
7.2.1 Inline track list (small Spaces)
For Spaces whose track count is modest (the common case — most Spaces have well under 10,000 tracks), the tracks field is an inline list:
Writers MUST switch to the paged form (§7.2.2) when the inline list would exceed 1 MiB of CBOR-encoded bytes. Implementations MAY switch sooner.
7.2.2 Paged track list (large Spaces)
For very large Spaces, the tracks field uses the same B-tree-of-Index-Pages primitive as Track Objects (§7.3.2):
A Manifest Index Page is structurally identical to a Track Index Page (§7.3.2) except that leaves carry track entries (the inline-form record above) and entries are sorted by (timeline_id, modality) rather than by time. Internal pages narrow the search by (timeline_id, modality) ranges.
Manifest Index Pages live at:
Read path for "what tracks does Timeline T have?":
- Read Manifest → get root page address.
- Recurse on internal pages whose
(timeline_id_min, timeline_id_max)covers T. - At leaves, collect track entries with
timeline = T.
In practice, paging the Manifest's track list matters only for Spaces with >10K tracks; it is defined here for symmetry with §7.3.2 and to avoid introducing a second mechanism later.
7.2.3 Manifest address
A Manifest is stored at the backend key manifests/<multihash-of-CBOR-bytes>. Manifests are not addressed under any Timeline because they reference many Timelines; they live in their own top-level namespace.
The address of the current manifest of a Space is either:
- The hash itself (hash-addressed Spaces), or
- A ref pointing at the hash (
refs/<ref-name>per §10), for ref-supported backends.
7.3 Track
A Track Object enumerates the Items (or Object-keys, when bucketed) belonging to one Track. Its CBOR encoding contains:
For boundary-spanning Object kinds (Fragment, Time-batch), the index records the actual time extent of each Object's contents — which MAY exceed the nominal bucket range when items span a boundary (per §6.3.1). The index is the authoritative source for time-range queries.
7.3.0 Form discrimination (inline vs paged)
The SDK MUST discriminate inline vs. paged form by inspecting the CBOR major type of the object_index field:
- CBOR major type 4 (array) → inline form (§7.3.1). Iterate entries directly.
- CBOR major type 5 (map) with field
"form": "paged"→ paged form (§7.3.2). Descend into the B-tree of Index Pages. - Any other shape → MUST be rejected as malformed. Surface to caller as a Manifest validation error (
0006§7).
The same rule applies to the Manifest's tracks field (§7.2): array → inline; map with "form": "paged" → paged; otherwise malformed.
This discrimination is purely structural — readers do not consult any external registry or metadata to know which form they're parsing. The CBOR shape is self-describing.
7.3.1 Inline form (small tracks)
For tracks whose index fits in a single Object without taxing reads, the index is inline. Each entry is a positional CBOR array (per §3.1's high-volume schema rule), with field order pinned per Track kind below. The same field order is reused by Index Page leaves in the paged form (§7.3.2 / 0007 §7.3), modulo the absolute-vs-delta time encoding (inline form uses absolute u64 anchors; paged-leaf form uses deltas relative to the page's t_min).
Per Track kind:
- Fragment-bearing Tracks (media):
[t_start, t_end, byte_size, fragment_address]per entry.t_endis the exclusive upper bound of the Fragment's time coverage;t_end > t_start. Fragments need not be contiguous (gaps allowed) or non-overlapping (layered Fragments may overlap). - Spatial-Bucket Tracks:
[spatial_key, t_start, t_end, byte_size, bucket_address, table_id?]per entry —table_idpresent iff the modality declarestables=L > 1(per0004§6.2). Multiple entries with the samespatial_keyare allowed to handle hot-spot spatial keys whose Items have been split across multiple bucket Objects per the splitting rule in0007. - Time-bucketed batch Tracks (events):
[t_start, t_end, time_bucket, batch_address]per entry.t_startandt_endare the actual time extent of the items in the batch. - Unbucketed Tracks:
[item_address]per entry (single-element array; the item's time anchor is encoded directly initem_address). - Constant Tracks: a single
constant_address(no array wrapper — Constant Tracks have exactly one item).
Forward-compatibility: per 0002 §3.1.1's array-length-as-version rule, future spec revisions MAY append fields after the v0 positions. Readers MUST iterate up to their known field count and ignore trailing fields.
Writers MUST switch to the paged form (§7.3.2) when the inline index would exceed 1 MiB of CBOR-encoded bytes. Implementations MAY switch sooner.
7.3.2 Paged form (large tracks: B-tree of immutable Index Pages)
At 1M+ items per track, an inline index becomes a 70+ MB Object — unacceptable for cold-start latency, write churn under live ingest, and warm-cache memory. The paged form replaces the inline list with a small reference to the root of an immutable, content-addressed B-tree of Index Pages.
Each Index Page is its own content-addressed Object with the address:
(The literal index segment distinguishes Index Pages from Track Objects, which use track, and from Item Objects, which use <spatiotemporal-key>.)
The CBOR encoding of an Index Page:
Read path (find Items overlapping [t_a, t_b]):
- Read the Track Object → get
rootpage address. (1 GET) - Recurse: at each internal page, descend into children whose
[t_min_in_subtree, t_max_in_subtree]overlaps[t_a, t_b]. (~log_fanout(N)GETs total.) - At leaves, collect matching Item/Object addresses.
For fanout = 256, 1M Fragments → tree height ≈ 3 → ~3 GETs of ~18 KB each ≈ 54 KB total per query (vs. 70 MB inline).
Write path (append a new Item/Object):
- Locate the leaf page that should hold the new entry.
- Create a new leaf page (old contents + new entry, re-sorted).
- Walk up the tree, creating new internal pages whose only change is the updated child pointer.
- The new root page address embeds in a new Track Object; old pages remain — they're immutable and may still be referenced by older Manifest versions.
For fanout = 256 and 1M items, append cost ≈ 3 new Pages × 18 KB ≈ 54 KB written per fragment (vs. 70 MB).
Page size and fanout are pinned in 0007 §7.1: default fanout B = 256, target page size 16 KiB, max page size 64 KiB. A leaf page that exceeds the target size on append is split; an internal page that exceeds it is also split, propagating the split up if necessary. Splits never modify existing pages — they create new pages and update the tree by copy-on-write up to the root.
Spatial-Bucket Tracks special case. The spatial-key already provides log-N navigation via base32 prefix-listing on the backend, so paged indexing for the spatial dimension is unnecessary. However, for very large spatially-bucketed tracks (millions of buckets), the enumeration of spatial-key → bucket-address mappings still benefits from the paged form. The leaf entries are then keyed by spatial_key (lexicographic order) rather than by time, with the same B-tree machinery.
7.3.3 Track Object address
The literal track segment distinguishes the Track Object from the Item Objects (under <spatiotemporal-key>/...) and from Index Pages (under index/...) within the same <timeline-id>/<modality-tag>/ prefix.
7.4 Item Objects (per kind)
7.4.1 Fragment (Continuous Signal media)
<time-bucket> is floor(t_start / bucket-duration) per §6.3.1 — a storage-layout hint, not a query primitive. A Fragment whose time coverage [t_start, t_end) crosses a bucket boundary is placed by its start; the Track Object's object_index records the exact extent and is consulted for time-range queries.
Writers SHOULD ensure max(fragment-duration) ≤ bucket-duration for balanced storage layout. Internal byte format defined in 0007. Fetched as a whole; sub-frame access is via the byte-range intra-object locator.
7.4.2 Spatial Bucket (Continuous Signal vectors at scale)
For spatiotemporally-partitioned modalities:
Internal format (per 0007): a small header followed by a packed array of vectors with their per-vector time anchors.
7.4.3 Time-bucketed batch (high-volume events)
<time-bucket> is floor(t_start / bucket-duration) per §6.3.1 — a storage-layout hint, not a query primitive. An event whose time anchor is [t_start, t_end) and whose t_end crosses a bucket boundary remains in the batch placed by t_start; the Track Object's object_index records the actual extent of items in the batch.
Writers SHOULD ensure max(event-duration) ≤ bucket-duration for balanced storage layout. Internal format (per 0007): an append-log of (time_anchor, payload) records sorted by time anchor.
7.4.4 Unbucketed Item Object (low-volume events, small embedding tracks)
The Object is the Item; no intra-object locator.
7.4.5 Constant
No spatiotemporal key segment; coverage is implicit.
7.4.6 Chunked Item Manifest (per 0014 Path A)
When a modality declares chunk-size=<bytes>, each logical Item is stored as N content-addressed chunk Objects plus one ItemManifest Object that enumerates them in byte order:
The chunks themselves live at the existing time-bucketed Fragment path (§7.4.1). The Track Object's FragmentEntry for a chunked Item is a 5-tuple with the trailing is_manifest = true (per 0014 §2.4); decoders fetch the ItemManifest, then range-fetch the listed chunks. Modalities without chunk-size produce 4-tuple FragmentEntries and never emit ItemManifests — existing Track Object hashes remain byte-identical.
7.4.7 Rendition Playlist (per 0014 Path B)
When a video modality declares renditions=N, the renditions are enumerated by a RenditionPlaylist Object:
Each rendition has its own Track Object referenced from the playlist's renditions[i].track_object field. The Manifest registry entry for the modality carries playlist: <hash> instead of a single Track reference.
7.4.8 TextIndex (per 0015 §3)
When a text modality declares an inverted-index algorithm (dreamdb.bm25, dreamdb.bm25-plus, dreamdb.splade-cosine), the index lives at:
Posting-list pages (paged Index Pages per 0002 §7.3.2) live at the parallel slot:
The two-segment text-index / posting form disambiguates the root TextIndex Object from its B-tree of posting-list pages.
7.4.9 MultiVectorIndex (per 0015 §4)
ColBERT-style late-interaction modalities (.multi-vec parameter) carry a MultiVectorIndex Object enumerating per-token vectors:
Internally references a SpatialIndex (any spec/0004 / 0013 algorithm) over the corpus's token-vectors plus a parent-doc resolution Track.
7.4.10 HotShard (per 0016 §2)
A HotShard buffers recent appends for a Track without forcing a full Manifest publish per Item:
The Manifest registry's hot_shard field points at the current HotShard. Readers consult both Track + HotShard during query resolution; flush converts HotShard contents into the Track Object via Manifest publish.
7.4.11 GraphPage (per 0013 §4.2)
Graph-indexed modalities (dreamdb.vamana-cosine, dreamdb.fresh-vamana-cosine) store the graph's adjacency-list pages at:
GraphPage Objects are NOT CBOR — they are a custom packed-byte format (per 0013 §4.2: 192-byte header + variable-size node records). Treated like Bucket Objects for storage purposes (raw bytes after a fixed header). The GraphIndex Object's graph_layout.page_node_count field determines which page a node-id lives in.
7.5 Global / cross-Timeline Object Kinds
Several Object Kinds are addressed independently of any single Timeline because they MAY be shared across Timelines (e.g. the same SpatialIndex Object may be referenced by multiple embedding Tracks on multiple Timelines). These live at top-level namespaces:
| Path prefix | ObjectKind | Defined in | Purpose |
|---|---|---|---|
manifests/ | Manifest | §7.2 | Per-Space manifest DAG nodes |
spatial-index/ | SpatialIndex | 0004 §3.2 | Partition algorithm params (LSH seed / centroids) |
scalar-index/ | ScalarIndex | 0011 §4 | B-tree / bitmap structured-filter index params |
vector-compressor/ | VectorCompressor | 0010 §3.1 | PQ / QINCo codebook + weights |
graph-index/ | GraphIndex | 0013 §3.1 | Vamana graph metadata + entry point |
federation-manifests/ | FederationManifest | 0012 §3.1 | Cross-backend manifest-of-manifests |
tenant-usage/ | TenantUsageBatch | 0018 §4.1 | Per-Space rolling resource-usage statistics |
refs/ | Ref | §10 | Per-backend mutable Manifest pointer |
federation-refs/ | FederationRef | 0012 §4.3 | Cross-backend mutable FederationManifest pointer |
tenant-usage-refs/ | TenantUsageRef | 0018 §4.3 | Per-tenant pointer at latest TenantUsageBatch |
All top-level namespaces share the convention: <prefix>/<multihash-of-canonical-CBOR-bytes>. The prefix is part of the path's parser-level discriminator (per §6.3); two ObjectKinds at different prefixes MAY happen to share a hash (statistically improbable, semantically irrelevant — they live in different namespaces).
8. Encoding: base32, paths, separators
8.1 Hash encoding
A multihash (algorithm tag + 32-byte BLAKE3) is 33 bytes. Encoded for use in addresses as lowercase base32 without padding (RFC 4648 §6 with = stripped). 33 bytes × 8 bits / 5 bits per char = ceil(52.8) = 53 characters.
Lowercase base32 is chosen over base64url for:
- Case-insensitivity. Some object stores normalize case in keys; some filesystems are case-insensitive. Using one case eliminates the issue.
- DNS-safety. Although DreamDB addresses do not appear in DNS, the same alphabet is friendly to human transcription and many ad-hoc URL handlers.
- No
+//characters, which conflict with path separators or require URL-escaping.
8.2 Path separator
The / character separates address segments in both the URI form (§9) and the backend object-key form. Backends that use a different native separator (e.g. some flat key-value stores) MUST either accept / literally or be wrapped by an adapter that translates.
8.3 Intra-object separator
The # character separates the object-address from the intra-object locator. This mirrors the URL fragment convention and ensures that backends (which never see anything past #) and SDKs both handle the boundary unambiguously.
9. URI Scheme: dreamdb://
For sharing addresses outside a single backend (e.g. in documentation, manifests of manifests, or chat messages):
Where <backend-hint> is an optional hint to the SDK about where to find the bytes — restricted to a single HTTP authority component (host[:port]) and MUST NOT contain /. Backend-specific path components like S3 bucket names are configured via the Connector or expressed via virtual-hosted style addressing (e.g. my-bucket.s3.example.com), NOT embedded in the URI. The hint is purely informational; the address itself is the source of truth, and the SDK MAY consult any backend it knows about.
Examples:
The empty backend-hint form (dreamdb:///...) means "use the default / current backend."
10. Refs Namespace
Refs (per 0000 §5.2) are mutable named pointers maintained by ref-conformant backends only. Their backend keys live in a separate top-level namespace:
<ref-name> is a lowercase ASCII path with /-separated segments. Each segment is [a-z0-9_-], length 1–64. Total ref-name length is bounded at 256 bytes.
Examples: refs/main, refs/release/v1, refs/users/alice/scratch.
A ref's value is the multihash of the Manifest it points to (33 bytes, base32-encoded for human display, raw bytes for backend storage). Ref updates use conditional writes (CAS / If-Match) per 0000 §5.2.
Cross-backend federated deployments additionally use the federation-refs/ namespace (per 0012 §4.3) for the multi-backend analog. Its CAS semantics are identical to refs/, but the authoritative copy lives at exactly one of the federation's participant backends — mirror backends pull from it rather than racing to advance it (cross-backend CAS is intractable at WAN scale, per 0012 §10).
11. Worked Examples
11.1 A Constant Track
A DreamDB Space records the title "FA Cup Final, 2nd half" for a single timeline.
- Timeline Genesis:
{ origin: 2026-05-06T09:00:00Z (ns), resolution: 1ns, horizon: [0,600s), nonce: 0xa3b9..., canonical_name: "match-2026-05-06" }- CBOR-encoded → 89 bytes
- BLAKE3-256 →
T_ID = 0x1e || <32 bytes>(33 bytes raw, 53 base32 chars)
- The Constant Object is the UTF-8 string
"FA Cup Final, 2nd half"(22 bytes)- BLAKE3-256 →
C_HASH = 0x1e || <32 bytes>
- BLAKE3-256 →
- Address:
<T_ID> / title.text / <C_HASH>- In string form:
xy7g...vqra/title.text/q9ng...uudk(~120 chars)
- In string form:
A query "what is the title?" computes T_ID (one CBOR-and-hash from the Genesis), then issues list-prefix(<T_ID>/title.text/). One backend list, one backend get of the matching key.
11.2 A Spatial Bucket
A 1B-vector embedding track on the same timeline. Modality: embedding.f32.dim=768.bucketed.spatial-bits=18.
- A vector at
t = 152.481 swith payload<3 KB>is hashed by the §6.3 spatial-key derivation (per0004) to the 18-bit key101100110011010101(base2, 18 chars). - The vector lands in the bucket whose spatial-key prefix is
101100110011010101. The Bucket Object contains ~3,000 vectors that share this prefix, packed as fixed-size 3080-byte records (8-byte time anchor + 3072-byte f32 vector) after a 160-byte header (per0007§6.1). - Bucket Object content hash:
B_HASH. - Address of the Bucket Object:
<T_ID> / embedding.f32.dim=768.bucketed.spatial-bits=18 / 101100110011010101 / <B_HASH> - The vector's byte offset inside the Bucket:
160 + 1247 × 3080 = 3,840,920. The vector's byte range:[3_840_920, 3_844_000). - Address of the individual vector (Item address) externalized as a URI:
<T_ID> / embedding.f32.dim=768.bucketed.spatial-bits=18 / 101100110011010101 / <B_HASH>#bytes:3840920-3844000.
A query "find vectors near this one" hashes the query vector to its 18-bit spatial key, issues 1-4 list-prefix(<T_ID>/<modality>/<spatial-key>...) calls (the SDK may truncate to a shorter prefix to widen recall — e.g. truncate to 15 chars for a 15-bit prefix that matches 8× more buckets), fetches the resulting Bucket Objects, and runs exact KNN locally. When returning a result, it emits the byte-range URI shown above; another SDK can fetch that vector's bytes via a single ranged GET, no DreamDB-specific layout knowledge needed.
11.3 A Fragment
A video Fragment for the same timeline. Modality: video.h264. The Fragment covers t = [60s, 62s) and is 1.4 MB.
- Time bucket:
floor(60_000_000_000 / 60_000_000_000) = 1(assuming 60-second time-buckets, encoding per0003). - Fragment content hash:
F_HASH. - Address:
<T_ID> / video.h264 / 1 / <F_HASH> - Address of frame at
t = 61.083 s: same as above plus#bytes:412800-415744(computed from the fragment-index on the Track Object).
Playback: the SDK consults the Track Object's object_index (cached), maps t = 61.083 s to the Fragment's address + byte range, issues one ranged-GET, and feeds the bytes to the decoder.
12. Out of Scope for this Document
- Time-key encoding —
<time-anchor>and<time-bucket>byte/string formats (0003). - Spatial-key derivation — how vectors become bit-strings (
0004). - Per-modality Object internal byte format — Fragment containers, Bucket packed-array layout, Time-batch log format (
0007). - Conformance test vectors — round-trip CBOR encodings, address-derivation test cases (
0009).
13. Open Questions Surfaced by This Document
- OQ-11 (→ 0009 §3.2): Concrete CBOR tag numbers. Resolved: single tag
dreamdb.tag = 65521(private-use range), used only in foreign-CBOR contexts. - OQ-12 (→ 0009 §3.1): Multihash algorithm tag values. Resolved:
0x1efor BLAKE3-256 (IPFS-aligned). - OQ-13 (→ 0007 §6.5): Spatial+time segment order for partitioned modalities. Resolved: spatial-first —
<timeline>/<modality>/<spatial-key>/<time-bucket>/<hash>. - OQ-14 (→ 0007 §7.1): Default Index Page fanout
Band target page size. Resolved:B = 256, target page size 16 KiB, max page size 64 KiB. - OQ-15 (→ 0007 §7.2): Inline-vs-paged switch threshold. Resolved: 1 MiB of CBOR-encoded inline bytes; implementations MAY switch sooner.
- OQ-16 (→ 0007 §6): Per-modality Object layout pattern. Resolved: fixed-size records (default for
embedding.f32.dim=N); in-Object offset table (fallback for variable-size payloads, used by Time-batches per §8); reference-mode for multi-table Spatial Buckets (per §6.2).
Next: 0003-time-encoding.md — fixes the byte/string format for <time-anchor> and <time-bucket>, and resolves OQ-1 (absolute vs. Genesis-relative origin).