DreamDBv0.2.0bec026

DreamDB Specification — 0006: Protocol Operations

Status: Draft. Builds on 0000-overview.md, 0001-data-model.md, 0002-content-addressing.md, 0003-time-encoding.md, 0004-spatial-indexing.md, and 0005-backend-interface.md. This document defines the verbs of the DreamDB Protocol — what the SDK does, end to end. It composes the HTTP primitives of 0005 into multi-step operations, formalizes the per-session cache discipline, and resolves OQ-27 and OQ-28.


1. Purpose

00010005 defined the state of a DreamDB Space — what entities exist, how they're encoded, where they live, and what HTTP semantics carry them. This document defines the verbs that operate on that state — the concrete sequences of HTTP requests an SDK generates to ingest, layer, query, and stream DreamDB data.

By the end of this document, the following are concrete:

  • The verb taxonomy — the small, well-defined set of operations a conformant SDK exposes to the application.
  • For each verb: inputs, outputs, and the canonical HTTP request sequence that implements it.
  • The per-session cache discipline — what the SDK MUST cache, MAY cache, and MUST NOT cache, with invalidation rules.
  • The concurrency model — what verbs can run in parallel, what guarantees writers and readers have under contention.
  • Failure semantics — what happens when a multi-step verb fails partway through, how recovery works, what the backend sees.

What this document does not define:

  • The exact application API surface. SDKs in different languages will expose slightly different signatures (Rust async vs. Python sync, etc.). The verb semantics are normative; the surface is implementation-defined.
  • The byte format inside Bucket Objects, Fragments, and Time-batches — 0007.
  • Manifest history walking, branching, and merge semantics — 0008.
  • Conformance test vectors — 0009.

2. The Verb Taxonomy

The DreamDB Protocol exposes eight verbs in v0:

VerbDirectionPrimary purpose
OpenReadBind to a Space; resolve a Manifest hash or Ref name
ResolveReadFetch and validate a Manifest by hash; populate cache
QueryReadFind Items by time, by feature, or both
StreamReadFetch a contiguous time range of a media Track
GetReadFetch a single Item by full DreamDB address
AppendWriteAdd Items to a Track; produce a new Track Object
LayerWritePublish a new derived Track over a parent Track
PublishWriteCommit a new Manifest atomically; optionally advance Ref

Verbs compose: a typical write sequence is Append → Append → Layer → Publish. A typical read is Open → Query → Stream.

2.1 Verbs that are NOT in the taxonomy

These operations exist in lower-level docs but are not first-class verbs:

  • LIST — only used by Open for cold-start bootstrap (0005 §5.3.1 Manifest Supremacy). Not a user-facing verb.
  • Update, Modify, Patch, Delete — content layer is immutable; corrections are new layers.
  • Transaction, Lock, Reserve — DreamDB is lock-free; concurrency is handled by content-addressing and ref CAS.
  • Subscribe, Watch, Stream-changes — out of scope for v0; future spec MAY add a pull-style notification verb.

A conformant SDK exposes the eight listed verbs. Additional convenience methods (e.g., a KNN shortcut wrapping Query with feature input) are implementation-defined and MUST be implementable in terms of the eight.

2.2 Phase-3 / Phase-4 verb additions

The following verbs are defined in post-v0 spec drafts and are OPTIONAL for v0 SDKs (REQUIRED for SDKs claiming Phase-3 / Phase-4 conformance):

VerbDirectionPrimary purposeDefined in
federateWriteCopy a Manifest's transitive closure between backends0012 §4
ReencodeWriteBulk re-index Items from a source modality to a target modality0017 §3

federate has two modes (push / pull) and capability-token auth (0012 §5). SDKs implementing it MUST implement the hash-verify-before-store discipline (0012 §5.1) or be considered non-conformant — the verb's safety story rests entirely on that check.

Reencode is resumable and idempotent — checkpoints published as Layer Manifests every batch_size Items. Crash-and-resume produces the same final state as uninterrupted execution. SDKs implementing it MUST honor the resume_from parameter and verify source-modality non-mutation before continuing.

2.2.1 Query extension: HybridQuery (per 0015 §5)

The Query verb (§4.3) is extended rather than replaced — its query_spec parameter is the structured CBOR sub-Object defined in spec/0015 §5.1, supporting per-modality sub-queries with fusion policy (RRF / linear / max / pareto). A pre-Phase-4 Query invocation with a single dense sub-query is the v0 path unchanged; multi-modality sub-queries activate the spec/0015 query planner.

Resolution of 0015 OQ-66: HybridQuery is not a new verb — it is a richer query_spec payload for the existing Query verb. The eight-verb taxonomy of §2 is unchanged.

A Phase-3 v0.X release SHOULD support federate; a Phase-4 release SHOULD support both Reencode and the extended Query payload shape. Pre-Phase implementations remain conformant within their phase bounds.

3. The Per-Session Cache

DreamDB's hot-path performance depends on aggressive client-side caching. This section pins down what the SDK caches, when it invalidates, and what guarantees the cache provides.

3.1 What is cached

Cache slotLifetimeInvalidation trigger
Genesis ObjectPer Timeline, until session endNever (Genesis is immutable; Timeline ID is its hash)
ManifestPer (Space, Manifest hash) pairNever directly; replaced when SDK Resolves a different hash
Track ObjectPer ManifestInvalidated when reading from a different Manifest
Index PagePer Track ObjectInvalidated with parent Track Object
SpatialIndex ObjectPer (modality, hash) pairNever (content-addressed; immutable)
Hyperplane tablePer SpatialIndex ObjectDerived; invalidated with SpatialIndex Object

3.2 Cache identity = content hash

Every cached Object is keyed by its content hash, not by its path or by any "freshness" notion. Two cache entries for the same hash are by definition identical bytes — content-addressing makes consistency trivial.

The SDK MUST NOT cache Objects keyed by anything other than content hash. In particular: caching a Track Object by (timeline, modality) would be a bug, because two different Manifests can have different Track Objects for the same (timeline, modality) pair (e.g., layered corrections).

3.3 What is NOT cached

  • list-prefix results. Per 0005 §5.3.1 Manifest Supremacy, list-prefix is bootstrap-only. Caching list-prefix results across queries would (a) re-introduce the eventual-consistency window the doctrine eliminates, (b) yield no benefit on the steady-state hot path that doesn't issue list-prefix anyway.
  • Ref → Manifest mappings, beyond the most recent fetch. Refs are mutable. The SDK MAY cache the latest fetch but MUST NOT treat a stale ref value as authoritative. Refs are re-fetched on Open invocations; a long-lived session with a fresh Resolve after work-loss is the right pattern.
  • Bucket Object bytes themselves, beyond what's needed for in-flight queries. A bucket fetched for one query MAY be retained briefly (an LRU window of seconds) but is not part of the protocol-level cache discipline. Aggressive vector caching is an application concern, not a DreamDB concern.

3.4 Cache scope

The cache is per-session, not shared across processes by default. A long-lived application process accumulates a cache that grows with the diversity of Manifests it has consulted; restarting the process clears the cache. SDKs MAY offer a persistent on-disk cache as an extension; this is implementation-defined.

3.5 Cache validation

Because every cached Object is content-addressed, validation is never required. The bytes either match the hash (cache hit, served directly) or they don't (cache corruption — the SDK MUST treat as a bug, evict, and re-fetch). There is no "stale cache" failure mode.

3.6 Cache eviction (memory bounds)

A long-lived SDK process with no eviction policy can accumulate unbounded cache as the application queries diverse Tracks across sessions. At billion-scale across many Tracks the cache could OOM the process: 1000 cached Track Objects × ~5 MB each ≈ 5 GB, plus Index Pages, plus SpatialIndex hyperplane tables.

Conformant SDKs SHOULD provide a bounded cache with a default size cap (recommended: 1 GiB total, configurable). When the cap is exceeded, evict using a standard policy:

  • LRU — least-recently-used eviction is the default recommendation.
  • Eviction is always safe because cache entries are content-addressed: a re-fetch produces bit-identical bytes against the same hash.
  • SDKs MAY use more sophisticated policies (TinyLFU, ARC, segment-LRU) but the default policy MUST be deterministic and bounded.

What MUST stay cached for the duration of an active query:

  • The Manifest the query is bound to.
  • The Track Object's root Index Page (if paged form).
  • Any Index Page currently being traversed.

These are protected from eviction by being held in the query's local stack/working set — outside the cache's eviction policy until the query completes.

The default 1 GiB is a starting point; SDKs running on memory-constrained devices (mobile, edge, embedded) SHOULD lower the default; high-throughput servers MAY raise it. Implementations MUST surface the configured cap to operators (logs, metrics, configuration files).

4. Read Verbs

4.1 Open(space_ref) → Session

Bind to a Space. space_ref is one of:

  • A Manifest content hash: dreamdb://<backend>/<manifest-hash>.
  • A Ref name (Ref-Conformant backends only): dreamdb://<backend>/refs/<ref-name>.

HTTP sequence:

GET <backend>/manifests/<manifest-hash>           -- if hash form
   OR
GET <backend>/refs/<ref-name>                     -- resolve ref → hash
GET <backend>/manifests/<resolved-hash>           -- then fetch manifest

Returns a Session handle that wraps:

  • The current Manifest (cached).
  • The backend URL and auth context.
  • A handle to the cache (§3).

The Session is opaque to the application but stable for the duration of subsequent verb calls.

Cold-start bootstrap (no Manifest or Ref known): the application provides a <backend> URL and the SDK issues LIST manifests/ to discover available Manifest hashes. This is the only sanctioned LIST invocation in the read path. The application then chooses which Manifest to bind to (typically the most recent by ts field).

4.1.1 Ref freshness for long-lived sessions

A long-lived Session (a streaming server, desktop application, persistent worker, etc.) bound via Ref MAY drift out of date as another participant Publishes a new Manifest and advances the Ref. v0 is pull-only — there is no push notification — so Sessions interested in seeing new Publishes MUST poll. Two patterns:

Periodic polling (recommended for sessions with an idle steady-state):

HEAD <backend>/refs/<ref-name>      -- ~1 KB total wire traffic
                                     -- returns ETag without body

The SDK MAY issue a HEAD on the Session's Ref every 30 seconds (default; configurable). If the returned ETag differs from the cached one, the Ref has advanced; the SDK then GET refs/<ref-name> to retrieve the new Manifest hash and Resolves it. Caches built against the old Manifest remain valid for any Object hash they still see in the new Manifest (content-addressing makes cross-Manifest reuse free).

Immediate-before-critical-read polling (recommended for latency-sensitive reads):

If the application requires an absolutely-fresh view for a particular Query, the SDK SHOULD HEAD the Ref before the read if the cached Ref ETag is older than a small threshold (default 5 seconds). One round trip; usually <50 ms. Avoids bound-stale-data correctness bugs in low-latency loops at modest extra cost.

Scope note: Ref freshness polling is unnecessary for short-lived Sessions (e.g., serverless function invocations that Open once and exit). Those are naturally fresh — there is no long-lived cache to drift. Polling is for sustained processes that hold a Session across many queries.

This polling pattern is the v0 substitute for Subscribe/Watch (out of scope, OQ-34). It composes cleanly with the existing Open/Resolve discipline.

4.2 Resolve(session, manifest_hash) → Manifest

Fetch a Manifest by hash, validate, cache.

HTTP sequence:

GET <backend>/manifests/<manifest-hash>           -- 1 GET (or cached)

Validation:

  • Decode CBOR.
  • Verify each <track-entry> references a known modality (built-in or in the Manifest's registry).
  • Verify each modality's required spatial_index references (for spatially-bucketed) point at hashes the SDK can fetch.
  • The fetched bytes' BLAKE3 MUST equal manifest_hash. Mismatch is a backend error (5xx category from the SDK's perspective).

4.3 Query(session, track_selector, query_spec) → ResultSet

The primary read verb. Three modes determined by query_spec:

ModeInputsUse case
query_spec.timeTime range [t_start, t_end)"What events happened between T1 and T2?"
query_spec.featureQuery vector + recall target ρ"Find vectors near this query vector"
Both time + featureTime range AND query vector"Find similar vectors recorded in last hour"

Hot-path HTTP sequence (Track Object cached):

-- All steps below are local, except step 4 fans out to parallel HTTP/2 GETs.

1. Look up track_selector → Track Object (cached).
2. Resolve query_spec → list of bucket addresses:
   - Time-only:    use object_index time-range overlap.
   - Feature-only: hash query vector → spatial-key prefix(es) → matching bucket addresses
                   from object_index. Multi-table: do this for each of L tables.
   - Combined:     intersect time and spatial-key constraints in object_index.
3. (Optional, application-defined) speculative preloading: issue GETs for the most likely
   buckets early; cancel after refinement. Implementation-defined.
4. Issue parallel ranged GETs (HTTP/2 multiplexed) for matching bucket addresses.
5. Decode each Bucket / Fragment / Batch (per 0007); accumulate candidates.
6. (Feature mode) exact-compare against query vector; rank top-k.
7. Return ResultSet — list of `(Item address, optional score, optional time anchor)`.

Cold-start sequence (Track Object not yet cached):

1. From the cached Manifest, find the Track Object's address.
2. GET the Track Object (1 small fetch; for paged-form, root Index Page + relevant subtree).
3. Cache the Track Object and its Index Pages.
4. Proceed as steps 2-7 of the hot path.

The cold-start cost is paid once per Track per session; subsequent queries on the same Track are full hot-path.

4.4 Stream(session, track_selector, time_range) → ByteStream

Fetch a contiguous time range of a media Track as a streaming-decoder-ready byte stream.

HTTP sequence (hot path):

1. Track Object (cached) → fragment-index (per 0002 §7.3).
2. Walk the fragment-index for Fragments overlapping time_range.
3. For each matching Fragment, compute byte-range-of-interest:
   - First Fragment: bytes from time_range.start within the Fragment.
   - Middle Fragments: full Fragment.
   - Last Fragment: bytes through time_range.end within the Fragment.
4. Issue parallel ranged GETs (HTTP/2) — one per Fragment; possibly with byte sub-ranges
   for the boundary fragments.
5. Concatenate the bytes in time order; emit as a ByteStream.

The output is a valid streaming-container fragment chain (per 0007's media format spec) — feedable directly to a media decoder. The SDK does not transcode, re-encapsulate, or otherwise process the bytes; it concatenates and streams.

Latency budget: First-byte latency ≈ 1 round trip to the first Fragment (~50 ms on commodity HTTP/2). The application can begin decoding the first Fragment while subsequent Fragments are still in flight.

4.4.1 Stream prefetch (look-ahead window)

Object-store tail latency is real. A request for a single Fragment occasionally takes 200–500 ms instead of the typical 50 ms — backend rebalancing, cold-cache misses, transient network hiccups. Without prefetch, every such tail event causes a visible playback stall. With prefetch, tail latencies are masked by the look-ahead buffer.

The SDK SHOULD prefetch ahead of the consumer's current position:

  • Default lookahead: 2 Fragments beyond the current consumer position. Aligns with HLS/DASH player conventions; sufficient to mask typical tail latencies (a 2-second Fragment plus 500 ms tail equals ~2.5 s, masked by a 4 s buffer at 2 Fragments × 2 s). The default is 2 (not 3) to minimize wasted prefetch on slow consumers.
  • Maximum lookahead: SDK-configurable, default cap 10 Fragments. Larger values waste bandwidth and per-request fees (S3 charges ~$0.0004 per 1000 GETs; a 1000 queries/sec workload with lookahead = 10 costs ~$14K/month in request fees alone). Implementations SHOULD NOT default higher than 10.
  • Adaptive sizing (REQUIRED for production): SDKs SHOULD dynamically adjust the lookahead based on observed consumer consumption rate:
    • If the consumer is consuming faster than fetches arrive (cache miss), grow lookahead (up to the cap).
    • If the consumer is consuming much slower than fetches arrive (cache fill, prefetched Fragments aging out), shrink lookahead toward 1.
    • Reset to default = 2 on stream restart or seek. Without adaptive sizing, slow consumers + aggressive default lookahead produce bandwidth and cost waste.
  • Cancellation discipline: when the consumer abandons the stream (closes the iterator, errors, or seeks to a different time range), the SDK MUST cancel in-flight prefetch GETs via HTTP/2 stream RST (the RST_STREAM frame). Leaving in-flight prefetches running wastes bandwidth and (on per-request-priced backends) incurs real charges.
  • Composition with Stream byte-range fetches: prefetch is a wrapper around the existing Stream HTTP sequence. The SDK issues N+lookahead concurrent ranged GETs against successive Fragments (HTTP/2 multiplexed); the consumer-side iterator yields bytes in order as they arrive.

Cost-aware default: the spec's "default 2, cap 10, adaptive" stance is calibrated to be safe-by-default on per-request-priced backends. Implementations targeting unmetered backends (self-hosted MinIO, fully-paid CDNs) MAY raise the default lookahead, but MUST honor the cancellation discipline regardless.

This is a performance hint, not a protocol-level guarantee. Conformant SDKs MUST function correctly with lookahead = 0 (synchronous fetch); but production-grade implementations SHOULD prefetch to mask backend tail latency.

4.5 Get(session, item_address) → Bytes

Fetch a single Item by its full DreamDB address (per 0002 §6.5).

HTTP sequence:

GET <backend>/<object-address>                          -- 1 GET
   Range: bytes=<start>-<end>                          -- if intra-object locator present

Returns the bytes directly — opaque to DreamDB, decodable by whatever interprets the modality (the application).

This is the simplest verb: one address in, one byte sequence out, one HTTP request. Used when the application has already located an Item via Query and wants its full payload.

5. Write Verbs

5.1 Append(session, track_selector, items) → AppendResult

Add items to an existing Track or create a new Track. items is a sequence of (time_anchor, payload) pairs (or single Constant for Constant Tracks).

Zero-Item handling: per 0001 §4.5, Append with zero Items SHOULD be a no-op — the SDK returns success without producing a new Track Object or Manifest. The application calling Append with empty input is expected to skip the subsequent Publish call.

Constant Track Append: per 0001 §4.5 the Track must have exactly one Constant. Append to a Constant Track MUST be called with exactly one Item; zero or two+ items is a programming error and the SDK MUST reject the call.

HTTP sequence (Continuous Signal — Spatial Bucket case):

1. For each item:
   a. Compute spatial_key(s) — one per SpatialIndex table (multi-table per 0004 §6.2).
   b. Determine target bucket Object (cached object_index has bucket → spatial_key map;
      if bucket would exceed size threshold, prepare a new bucket Object).
   c. Append item to bucket-in-construction (in-memory).

2. Once buckets are full or the session flushes:
   a. PUT each new bucket Object to backend. Backend address is content-derived.
      For each: PUT <backend>/<spatial-bucket-path>  (idempotent; 412 = already exists, OK).
   b. PUT new Index Pages reflecting the appended entries (paged form).
   c. PUT the new Track Object, referencing the new root Index Page.

3. Return AppendResult containing the new Track Object's address. The application calls
   Publish (§5.5) to commit a new Manifest referencing it.

HTTP sequence (Continuous Signal — Fragment / media):

1. Encode items into Fragment(s) per 0007's container format.
2. For each Fragment: PUT <backend>/<fragment-path>.
3. PUT updated fragment-index Index Pages (paged form) or inline index entries.
4. PUT new Track Object.
5. Return AppendResult.

HTTP sequence (Discrete Event — bucketed):

1. Group items by time_bucket = floor(t_start / bucket-duration).
2. For each time-bucket: PUT <backend>/<batch-path>.
3. PUT updated Index Pages or inline entries.
4. PUT new Track Object.
5. Return AppendResult.

Standard write ordering (per 0005 §5.3.1): leaf Objects first (Buckets, Fragments, Batches), then Index Pages, then Track Objects. This ensures Manifest Supremacy: a reader resolving a future Manifest that references the new Track will find all dependencies live.

5.2 Append — atomicity and partial failure

Append is not atomic at the protocol level. Each PUT in the sequence is independently atomic; the sequence as a whole is not. Concrete implications:

  • If Append fails at step N, steps 1..N-1 have left content-addressed Objects on the backend that no Manifest yet references. These are orphan Objects. They cost storage but corrupt nothing.
  • The application MAY retry: re-running Append with the same items produces identical content hashes (deterministic CBOR, deterministic spatial keys). PUTs for already-present Objects are idempotent (412 → success). Effectively, retries pick up where the failure left off.
  • Operator-level GC (per 0005 §3.6) periodically reclaims orphans whose Manifests were never published.

No multi-PUT atomicity is required from the backend. The protocol's correctness comes from the leaf-first ordering plus content-addressing — not from transactional semantics.

5.3 Layer(session, parent_track_selector, derived_track_spec) → LayerResult

Publish a new Track that derives from an existing parent Track on the same Timeline.

HTTP sequence:

1. Compute the derived Track's Items from the parent (application logic; e.g.,
   embedding extraction from a video Track).
2. Equivalent to Append (§5.1) for the new Track's modality, with one addition:
   the new Track Object's CBOR includes role = "layer-of:<parent-track-address>".
3. Return LayerResult with the new Track's address.

The new Track is structurally a regular Track; the layer relationship is a Manifest concern, declared in the Manifest's tracks entry (per 0001 §6 and 0002 §7.2).

5.4 Publish(session, manifest_spec) → PublishResult

Atomically commit a new Manifest. This is the only verb that changes the visible state of the Space.

HTTP sequence:

1. Build the new Manifest CBOR:
   - parent: hash of the previous Manifest (from the session's current state).
   - tracks: union of previous Manifest's tracks + new tracks from prior Append/Layer calls.
   - registry: union of previous Manifest's registry + new modalities/SpatialIndex refs.
   - ts: current wall-clock as Unix-ns (per 0003).
   - writer: opaque writer tag.

2. PUT <backend>/manifests/<new-manifest-hash>            -- 1 GET (or 412 for re-publish).

3. (Ref-Conformant only) Advance the Ref:
   GET <backend>/refs/<ref-name>                          -- get current ETag.
   PUT <backend>/refs/<ref-name>                          -- with If-Match: <etag>.
   If 412: re-fetch ref, re-apply (rebuild Manifest if its parent was superseded).

4. Return PublishResult containing the new Manifest hash.

Concurrency under Ref CAS: two concurrent writers attempting to advance the same Ref will see one succeed; the loser receives 412 and either retries or rebuilds the Manifest with the winner's Manifest as the new parent. This is the optimistic-concurrency story from 0000 §5.2 made concrete.

Hash-addressed Spaces (no Ref): Publish simply PUTs the new Manifest. Discovering it requires out-of-band manifest distribution (the writer hands the hash to readers via some channel — Slack, email, another system). Two concurrent writers in this mode produce two diverging Manifests; reconciliation is an application concern.

5.5 Multi-Object writer transactions in summary

A typical writer transaction:

session = Open(backend_url, ref="main")            -- get current Manifest

# Phase 1: stage new content
result_video  = Append(session, "video.h264",  video_items)
result_embed  = Layer(session, "video.h264",  derived_track_spec=embedding_spec)
result_title  = Append(session, "title.text", [title_constant])

# Phase 2: commit
publish = Publish(session, manifest_spec={
   tracks: [result_video.track, result_embed.track, result_title.track],
   ref:    "main",
})

Phase 1 PUTs leaf Objects, Index Pages, and Track Objects — all content-addressed and idempotent. If the process crashes between Phase 1 and Phase 2, the Phase 1 Objects are orphaned (no Manifest references them), but the Space's visible state is unchanged. A retry of the entire transaction reproduces identical bytes (deterministic encoding) and Phase 1 PUTs become no-ops; only Phase 2 makes progress.

This is the lock-free collaborative pattern from 0000 §5.2 in practice: any number of concurrent writers can stage Phase 1 work in parallel without coordination, and the only contention point is the Ref CAS in Phase 2.

6. Concurrency Model

6.1 Reader-reader concurrency

Trivially safe. Readers share no mutable state; backend GETs are idempotent. Multiple SDK sessions on the same Space can issue parallel Query and Stream calls without coordination.

6.2 Reader-writer concurrency

Safe by Manifest Supremacy. A reader bound to Manifest M_n is unaffected by a writer publishing M_{n+1} — the reader continues to resolve via M_n's object_index. The reader sees a consistent snapshot of the Space at M_n until the application chooses to Resolve a newer Manifest.

A reader that wants to "see new writes" calls Resolve(latest_hash) (or fetches the Ref again). Until that call, the reader is operating on a stable snapshot.

6.3 Writer-writer concurrency

The lock-free pattern. Two writers W_a and W_b:

  • Both stage Phase 1 in parallel. Their leaf Objects, Index Pages, and Track Objects are content-addressed and PUT independently. PUTs are idempotent; even if both writers happen to compute the same content (rare but possible — same modality, same Items at same time anchors), they produce identical bytes and the second PUT is a no-op.
  • Both attempt Phase 2. If both target the same Ref, one wins the CAS, one loses (412). The loser:
    • Re-fetches the Ref → new winner Manifest hash.
    • Resolves the winner's Manifest.
    • Rebuilds its own Manifest with the winner as the new parent, preserving its staged tracks.
    • Re-attempts the CAS.

The resolution preserves both writers' work in a serialized history. The losing writer's Phase 1 Objects are NOT thrown away — they're still content-addressed Objects on the backend, and the rebuilt Manifest references them.

For hash-addressed Spaces (no Ref), there's no central coordination point; concurrent writers produce diverging Manifests and any reconciliation must happen out-of-band.

6.4 Adaptive recall widening (resolves OQ-28)

Per 0004 §6.5 (the combined recall-widening procedure; renumbered from §6.4 in 2026-05 when read-time multi-probe was promoted to first-class Lever 4), the SDK MAY iteratively widen the spatial-key prefix-truncation depth M if the initial result set is too small. v0 makes this implementation-defined with two guidelines:

  • The SDK SHOULD start with a M that achieves the requested recall ρ for a default θ_max = 30° (or the modality's declared default, if any).
  • If the result set has fewer than k (the user's requested top-k) candidates, the SDK MAY decrement M and re-issue prefix queries against the cached object_index. (No additional list-prefix round trip — the index is already cached.)
  • The SDK MUST stop widening at M = 0 (full-track scan). Reaching M = 0 indicates either a query vector with no near neighbors in the Track (legitimate) or a misconfigured modality (not DreamDB's concern).

Adaptive widening composes naturally with multi-table — each table widens independently.

6.5 Speculative preloading (resolves OQ-28)

Per 0004 §7.3 and 0005 §3.5, the SDK MAY issue speculative GETs for "most likely" Bucket / Fragment Objects while a list-prefix call (cold-start path) is still in flight. v0 makes this implementation-defined:

  • The SDK MUST NOT issue speculative GETs for paths it cannot prove exist in the cached object_index (which would generate spurious 404s, wasting backend calls and possibly costing money under per-request pricing).
  • The SDK MAY issue speculative GETs for paths derived from a partial result of a paginated list-prefix (i.e., start fetching the first page's hits while later pages are still loading).
  • Cancellation: if a speculatively-fetched Object turns out to be unneeded, the SDK SHOULD cancel the in-flight HTTP/2 stream rather than waste bandwidth.

7. Failure Semantics

7.1 Per-verb failure modes

VerbFailure modesRecovery
OpenBackend unreachable, manifest 404, ref 404, hash mismatchRetry; if persistent, surface to application
ResolveManifest 404, hash mismatch, CBOR malformedTreat hash mismatch as critical (corrupt data); 404 likely a typo or removed Manifest
QueryTrack Object missing, bucket 404, decode errorSee §7.4 — surface as ObjectNotFound; do NOT silently treat as zero results
StreamFragment 404 mid-streamSurface as ObjectNotFound; do not silently skip Fragments
Get404Surface as ObjectNotFound
AppendPUT failure mid-sequenceRetry from the failed step; idempotent PUTs make retry safe
LayerSame as AppendSame
PublishManifest PUT 5xxRetry. Don't advance Ref until Manifest PUT succeeds.
Ref CAS 412Per §6.3 — rebuild Manifest with new parent, retry CAS

7.2 Crash recovery

A crash mid-Append leaves orphan Objects but no inconsistency. A crash mid-Publish (Manifest PUT succeeded but Ref CAS hadn't completed) leaves the new Manifest content-addressed on the backend; the next Open call won't find it via Ref but can be told its hash directly. Application-level recovery (write-ahead logging the staged Manifest hash before attempting Ref CAS) is implementation-defined.

7.4 ObjectNotFound — error class for missing Objects

When an SDK fetches an Object referenced by a Manifest and receives 404 Not Found, the SDK MUST surface this as a distinct error class — ObjectNotFound — to the application. It MUST NOT silently treat the missing data as zero results, hide the error, or retry indefinitely.

The error MUST carry:

  • The full DreamDB address of the missing Object (the address that returned 404).
  • The Manifest hash from which this Object was reachable (the path that led the SDK to this Object).
  • The Object kind (Bucket / Fragment / Time-batch / Track Object / Index Page / etc.) inferred from the address path.

7.4.1 When ObjectNotFound legitimately occurs

The ObjectNotFound condition is abnormal but bounded — it indicates that some operator-level event has caused an Object to disappear from the backend while a Manifest still references it. Legitimate causes:

  • GC race (per §7.3.2.1): a long-running write transaction's Phase-1 Object was GC'd before its Manifest was published.
  • Operator-deleted Ref: a Ref was explicitly deleted, GC ran, and an SDK with a cached Manifest hash from that retired Ref tries to query.
  • Cross-backend federation gap: a Manifest was federated from one backend to another, but the transitive Object closure wasn't completely copied yet.
  • Corrupt operator action: an operator accidentally deleted Objects without GC's reachability check.

In all these cases, the data is genuinely missing — there is no recovery the SDK alone can perform.

7.4.2 Application-level handling

Applications encountering ObjectNotFound have three reasonable strategies:

  1. Propagate: surface the error to the user / caller. The query failed; show a meaningful error message naming the missing data.
  2. Fall back: the application has a different Manifest (an older snapshot, a different Ref) that still has all its Objects intact; retry against that.
  3. Recover (operator action): identify the missing Object's logical content from external state (an upstream pipeline's output, a backup), re-PUT it. Same content → same hash → restored. The Manifest becomes resolvable again.

DreamDB's role ends at correctly surfacing the ObjectNotFound error with enough context. Recovery strategy is application-specific.

7.4.3 Distinguishing ObjectNotFound from a typo

A 404 on Get(<address>) for an address the application supplied directly (without going through a Manifest) is distinguishable from ObjectNotFound:

  • The address may simply have never existed (typo, fabricated hash).
  • This is a programming error, not a data-integrity event.

SDKs MAY surface this as a distinct error (AddressNotFound) or as ObjectNotFound with a "no Manifest context" marker. Implementation-defined.

7.3 Garbage and orphan collection

Orphan Objects accumulate over time. Sources include:

  • Failed Append transactions (Phase 1 PUTs succeeded but Publish never happened).
  • Aborted Publishes (Manifest PUT succeeded but Ref CAS lost; the staging writer either rebuilt and won, or abandoned, in either case orphaning the original Manifest).
  • Reverted experiments (writer staged Phase 1 work, decided not to commit).
  • Branches that were never merged or referenced again (0008 will detail).

At even modest write rates, orphan accumulation matters within months. DreamDB provides no protocol-level GC verb (it would require centralized coordination, conflicting with the lock-free design); operators run GC out-of-band. The recommended algorithm:

7.3.1 Two-step GC algorithm

Step 1: compute the reachable set.

reachable = ∅
for each Ref live on the backend:
   manifest_hash = GET refs/<ref-name>
   walk the Manifest DAG starting at manifest_hash:
      for each visited Manifest (down to a configurable depth, default infinite):
         add its hash to reachable
         add its parent hashes to the walk queue
         for each Track entry in tracks (paged or inline):
            add Track Object hash to reachable
            walk Track Object's object_index (paged or inline):
               add each Index Page hash to reachable
               add each Item Object hash (Bucket / Fragment / Batch / Constant) to reachable
               -- For reference-mode Spatial Bucket Objects (per 0007 §6.3.1),
               -- transitively walk into the Bucket and mark every VS Object
               -- it references via vec_obj_hash:
               for each Bucket Object in object_index:
                  if Bucket is reference-mode:
                     decode Bucket's reference table
                     for each reference's vec_obj_hash:
                        add VS Object hash to reachable
         add SpatialIndex Object hashes from registry to reachable
      add Genesis Object hashes for each Timeline to reachable

The walk is fan-out-bounded: each Manifest references a fixed structure; the total work is O(reachable-set-size). Live workloads typically have 10⁵–10⁹ reachable Objects; the walk fits in memory as a set of 33-byte multihashes.

Step 2: LIST + diff + age-threshold + DELETE.

LIST the entire backend (admin scan; exhaust pagination per 0005 §3.5.2):
   for each Object key:
      extract the content-hash from the DreamDB path
      if hash ∈ reachable:
         keep
      elif Last-Modified > now - safety_threshold:
         keep      -- in-flight transaction; might be made reachable later
      else:
         DELETE the Object

The Object's Last-Modified (returned by HEAD/LIST on every modern backend per 0005 §10) is the GC's age signal. No custom Object metadata is needed — relying on backend-native timestamps avoids a Connector translation layer for x-amz-meta-* / metadata.* / x-ms-meta-* flavors that don't add information beyond what's already present.

7.3.2 Safety threshold

Operators SHOULD use a default safety_threshold of 24 hours. Rationale:

  • Phase-1 Append work that hasn't been Published in 24 h is virtually certainly abandoned (no production Publish takes that long).
  • Long-running multi-stage ingest pipelines (e.g., ingest video → extract embeddings → publish) typically complete within minutes; 24 h is generous headroom.
  • High-write-rate or high-latency systems with long Publish chains MAY use 48 h or 72 h. Configurable per operator.
  • Never less than 1 hour. Aggressive thresholds risk deleting Objects from in-progress transactions and corrupting the resulting Manifest.

7.3.2.1 GC vs. long-running transactions: the race window

The naive race scenario: a writer at PUT time T_write_start puts a leaf Object. GC runs later at T_GC_start. If T_GC_start − T_write_start > safety_threshold AND the writer hasn't yet Published a Manifest referencing that leaf, GC could DELETE the leaf — the writer then publishes a Manifest referencing missing data.

The 24h default safety threshold covers most workloads, but a multi-day ingest pipeline (e.g., transcribing a long-running video stream + extracting embeddings + publishing) can plausibly exceed 24 h between Phase-1 PUT and Phase-2 Publish.

Two patterns mitigate the race; writers SHOULD adopt one:

Pattern A — Bounded transactions (preferred): writers SHOULD complete each DreamDB transaction (Phase 1 → Phase 2) within 1 hour of starting Phase 1 PUTs. Long-running ingest pipelines structure their work as a sequence of bounded transactions — e.g., publish each hour's worth of fragments and embeddings as one transaction, rather than accumulating days before a single Publish.

Pattern B — Touch-to-extend: when a transaction must span longer than the safety threshold, the writer SHOULD periodically re-PUT its in-flight leaf Objects with the same content. Backend updates Last-Modified on the second PUT (idempotent at the bytes level — same hash), buying another safety_threshold window. Operators SHOULD touch every staged Object every safety_threshold / 2 to maintain bounded race risk.

Operator tuning: environments with structurally long-running transactions SHOULD raise safety_threshold proportionally (e.g., 72h for video-archive ingest, 168h for week-long batch jobs). The threshold trades GC efficiency for transaction safety; pick the smallest value that comfortably exceeds the longest legitimate transaction duration.

Recovery if the race fires: if a Manifest is published referencing a GC'd Object, readers will encounter ObjectNotFound (per §7.4) when resolving that Manifest. The Manifest is dead in the sense that no reader can complete reads from it. Recovery is operator-level: identify the missing Object's logical content (from application context), re-PUT it (same bytes → same hash → restored), and the Manifest becomes resolvable again. This is not graceful but it's bounded: the lost-data window is between GC and Object re-PUT.

7.3.3 GC operational notes

  • GC is safe to run concurrently with writers and readers, by construction. Reachable Objects are protected by the reachability check; in-flight Objects are protected by the age threshold; the immutability of all content-addressed Objects means there is no "object being modified" race.
  • GC is idempotent — running it twice does no harm; the second run finds no additional candidates.
  • GC SHOULD be run periodically (e.g., daily). Ad-hoc GC after large failed transactions is also valid.
  • For multi-Region deployments, GC SHOULD walk Refs from every Region before deleting (a Ref in one Region might reference Objects another Region considers orphans).

This is the same "mark and sweep over an immutable substrate" pattern Git uses for unreachable commits.

7.3.4 Scaling: full-walk GC vs. incremental GC

The full-walk algorithm above is O(historical Manifests + Objects) in cost. A 10-year-old Space with live ingest at 1 Manifest/minute accumulates ~5M Manifests in history; the reachable-set walk traverses every one. For very long-running Spaces this becomes operationally expensive (hours to days per GC run, multi-GB working set).

Incremental GC (deferred to v0.1 as a fully-specified pattern) reduces this to amortized work proportional to recent activity:

  • The operator periodically writes a GC checkpoint Object to the backend recording (a) the set of Manifest hashes reachable as of the checkpoint time, (b) the timestamp of the checkpoint.
  • Subsequent GC runs walk only Manifests newer than the latest checkpoint plus a small fixed lookback window, take the union with the checkpoint's reachable set, then proceed with the LIST + diff phase as usual.
  • Checkpoint Objects are themselves content-addressed and protected from deletion by being reachable from the GC tooling's own pseudo-Ref.

For v0, operators of Spaces with >100K historical Manifests SHOULD plan for either full-walk GC at low frequency (weekly / monthly) or a custom incremental approach. The full-walk algorithm is correct and conformant; it's just slow at extreme history depths. v0.1 will pin the incremental pattern as a normative SHOULD with checkpoint Object format.

8. Worked End-to-End Example

A 1-hour video is ingested, embedded for semantic search, and then queried.

8.1 Ingest

session = Open("https://my-bucket.s3.amazonaws.com/", ref="main")

video_items = [Fragment_0, Fragment_1, ..., Fragment_1799]   # 2-second fragments
result_v = Append(session, "video.h264", video_items)
# HTTP: 1800 PUTs for Fragments + ~10 PUTs for Index Pages + 1 PUT for Track Object.
# All HTTP/2 multiplexed; ~30 s on a 1 Gbps link.

embed_items = [Vector(t_i, v_i) for each video keyframe]    # ~3,600 vectors
result_e = Layer(session, "video.h264", derived_track_spec=embedding_768_spec)
# HTTP: ~50 PUTs for Spatial Bucket Objects + ~5 PUTs for Index Pages + 1 PUT for Track Object.

title = Constant("My Video Recording")
result_t = Append(session, "title.text", [title])
# HTTP: 1 PUT for Constant Object + 1 PUT for Track Object.

publish = Publish(session, {
   tracks: [result_v.track, result_e.track, result_t.track],
   ref: "main",
})
# HTTP: 1 PUT for new Manifest + GET+PUT (CAS) on refs/main.
# Total ~5 ms for the commit step.

8.2 Query (cold-start, second day)

session = Open("https://my-bucket.s3.amazonaws.com/", ref="main")
# HTTP: 1 GET refs/main + 1 GET manifests/<hash>. ~50 ms.

result = Query(session, "embedding.f32.dim=768.bucketed.spatial-bits=18", {
   feature: query_vector,
   k: 10,
   recall: 0.9,
})
# HTTP cold-start:
#   1. GET Track Object (1 GET) + GET root Index Page (1 GET) + GET relevant subtree
#      Index Pages (~3 GETs).  Total ~5 GETs, ~20 ms.
#   2. GET 16 matching Spatial Bucket Objects (HTTP/2 multiplexed). ~80 ms.
#   3. Local exact-KNN over ~60K candidates. ~20 ms.
# Total: ~120 ms p50.

# Subsequent queries on the same Track Object hit the cache: ~80 ms (just step 2).

8.3 Stream (the moment of interest)

result_item = result.items[0]                       # top-1 from the query
time_anchor = result_item.time_anchor               # 152.481 s

stream = Stream(session, "video.h264", [time_anchor - 5_000_000_000,
                                         time_anchor + 5_000_000_000])
# 10-second window centered on the result. HTTP:
#   - Track Object cached → fragment-index resolves overlapping Fragments.
#   - 5 Fragment ranged-GETs (covers 10 s of 2-second fragments).
#   - First-byte latency ~50 ms; full stream readable as it arrives.

play(stream)                                          # application's media decoder

End-to-end query → playback < 200 ms. Sub-100 ms achievable with smaller modality buckets and PQ compression in 0007.

9. Out of Scope for this Document

  • Cross-Timeline join verbs. DreamDB v0 has no such join (per 0001 §11). Future spec MAY add one.
  • Subscribe / Watch verbs. v0 is pull-only. The SDK polls Refs to discover new Manifests; push-style notifications are out of scope.
  • Branching verbs (Branch, Merge). Manifest history walking and merge semantics are 0008.
  • Garbage collection verb. GC is operator-level (per §7.3), not a protocol verb.
  • Cross-Space federation. v0 SDK opens one Space per session. Multi-Space joins are application concerns.

10. Open Questions Surfaced by This Document

  • OQ-33 (→ 0009 §7): Conformance test vectors for verb behavior. Resolved: full battery in 0009 §7 covering standard Append → Publish round-trip; Append retry after mid-sequence failure; concurrent Publish reconciliation; cold-vs-hot-path Query latency assertions; GC algorithm; Stream prefetch under simulated tail latency; Ref-freshness polling.
  • OQ-34 (→ v0.1 spec): A Subscribe or Watch verb that allows the SDK to be notified of Ref advancement (vs. polling). Out of scope for v0; future spec should address.
  • OQ-35 (→ 0008 §4.2): Explicit parent for Publish. Resolved: Publish(session, manifest_spec={parents: [<hash>], ...}) accepts an explicit parents array; absent → implicit parent = Session's loaded Manifest tip.

Next: 0007-streaming-encapsulation.md — defines the byte format inside Fragments, Spatial Buckets, and Time-bucketed batches; pins fragment durations, bucket-splitting thresholds, byte-range vs. inline storage decisions (OQ-23, OQ-24); and resolves OQ-4, OQ-7, OQ-13, OQ-14, OQ-15, OQ-20, OQ-21.