DreamDBv0.2.0bec026

Spec 0012 — Federation & Cross-Cluster Queries

Status: Draft (Phase 3 design). Depends on: spec/0001, spec/0002, spec/0005, spec/0008. Motivation: spec/0008 §8.3 acknowledges that a Manifest's content closure can be served from any backend that holds it but defers a protocol-level replication primitive (OQ-40). spec/0008 §10 also explicitly punts "Cross-Space federation primitives" to v0.1+. At billion-scale, single-backend DreamDB already works (ref-implementation hits ~1B vectors per Track). Ten-billion-scale demands cross-backend operation by definition: one S3 region's tail latencies, capacity ceilings, and blast radius force operators to shard across regions/clouds/clusters. This spec defines the on-wire and operational primitives that make cross-backend DreamDB coherent rather than a pile of mutually-opaque single-cluster deployments.


1. Purpose

Federation is the discipline of treating N backends as one logical DreamDB without:

  • Centralizing the data plane (would break the "lock-free data plane" principle from spec/0000 §5.2).
  • Inventing a new naming layer (addresses remain <timeline>/<modality>/...; the timeline IS the unit of federation).
  • Requiring transactional coordination across backends (impossible at WAN scale; spec/0005 already states eventual consistency for refs).

By the end of this document the following are concrete:

  • The federation models: mirror (whole-Timeline replication), shard (per-Timeline partitioning), hybrid (mirror + shard).
  • The Federation Manifest: a new ObjectKind that points at component Manifests across backends. Manifest-of-Manifests for cross-backend queries.
  • The federate verb: protocol-level primitive to push or pull a Manifest's transitive closure between backends. Resolves OQ-40.
  • The cross-cluster query path: scatter-gather over Federation-Manifest children with bounded fan-out and partial-result semantics.
  • The trust & capability model: read-only by default, capability tokens for write-through, per-backend hash verification mandatory.
  • Storage and bandwidth budgets at 10B scale.

What stays defined elsewhere:

  • The address grammar — spec/0002.
  • Per-Timeline content addressing & immutability — spec/0001, spec/0002.
  • The HTTP backend contract — spec/0005.
  • Manifest DAG and refs — spec/0008.

What this document does NOT define:

  • Cross-backend transactions. Federation is eventually consistent across backends; spec/0008's CAS guarantee remains a per-backend property.
  • Quorum-based consistency models (Paxos, Raft). Operators wanting strong cross-backend consistency layer their own consensus on top of single-backend DreamDB.
  • Service discovery / registry. How a client learns "backend B exists" is operator-coordinated — DNS, a config file, a service mesh.
  • Per-modality cross-backend indexing (e.g. a global vector index spanning N backends). Deferred to spec/0013 (graph-based ANN with cross-shard routing).

2. Federation models

Three operational patterns. A deployment MAY combine them per-Timeline.

2.1 Mirror

The same Timeline T is published to backends A and B. Both hold the transitive closure of T's latest Manifest. Read queries route to either backend; write coordination is operator-defined (one-writer per Ref is the simplest; spec/0008 §6 covers multi-writer rebase).

Use case: HA / DR; serving reads from the geographically closest backend.

Storage cost: 2× (or N× for N mirrors). Bandwidth cost: one-time closure transfer per Manifest publish; ongoing delta transfer per write.

2.2 Shard

Different Timelines live in different backends. T_a in backend A, T_b in backend B, T_c in backend C. A cross-backend query spans the Federation Manifest that points at all three.

Use case: capacity scaling (one S3 region too small or too expensive); jurisdictional sharding (EU data in EU bucket); per-tenant isolation.

Storage cost: 1× — each Timeline lives in exactly one backend. Bandwidth cost: zero replication; per-query scatter-gather instead.

2.3 Hybrid

A Timeline is sharded by anchor range or by Track. Backend A holds Tracks 0–999; backend B holds Tracks 1000–1999. The Federation Manifest carries the sharding rule.

Use case: ultra-large Timelines (>10B Items in one logical stream) where mirror is too expensive and shard-by-Timeline loses cross-Timeline query coherence.

Storage cost: 1×. Bandwidth cost: per-query scatter-gather, partial — only the shard(s) covering the query range participate.

3. The Federation Manifest

A Federation Manifest is a new ObjectKind that names component Manifests living in distinct backends. It is content-addressed like any other Manifest; its hash is independent of the backends that resolve it.

Address path:

federation-manifests/<multihash-of-canonical-CBOR-bytes>

(New top-level namespace, parallel to manifests/. Federation Manifests are themselves NOT bound to a single backend; the same Federation Manifest hash MAY be served from any backend that holds the bytes.)

3.1 CBOR encoding

{
  "version":  1,                                ;; format version
  "parents":  [<multihash>, ...],               ;; predecessor Federation Manifests (DAG, per spec/0008)
  "shards":   [                                 ;; ordered list; ordering carries the query priority
    {
      "manifest_hash":  <multihash>,            ;; component Manifest's content hash
      "backends":       [<url>, ...],           ;; HTTP(S) URLs where the closure is resolvable
      "shard_role":     "<role-id>",            ;; "mirror" | "shard" | "hybrid-range"
      "shard_key":      <sub-Object | null>,    ;; non-null for "hybrid-range" or "shard"; see §3.2
      "trust":          <sub-Object>,           ;; capability / verification policy (§5)
    },
    …
  ],
  "default_quorum": <unsigned int>,             ;; min successful shards for a query response; default = len(shards)
  "registry":       { … },                      ;; optional cross-shard schema, mirrors spec/0002 §7.2.2
}

parents, like the Manifest DAG in spec/0008, supports atomic federation-level commits: a new Federation Manifest pointing at a different shard composition is the equivalent of an in-cluster Publish. Federation Refs (§4.3) advance via CAS exactly like per-backend Refs.

3.2 Shard-key sub-Object

For shard_role = "shard" (per-Timeline shard) or "hybrid-range" (within-Timeline anchor-range shard):

;; "shard" — disjoint Timelines per backend
{
  "timeline_id":    <multihash>,                ;; the Genesis hash of the Timeline this shard holds
}

;; "hybrid-range" — anchor-range partition within one Timeline
{
  "timeline_id":    <multihash>,                ;; the Timeline (same across hybrid shards)
  "anchor_range":   [<lo: u64>, <hi: u64>],     ;; half-open: covers anchors lo ≤ a < hi
  "track_modality": "<modality-tag | null>",    ;; null ⇒ all modalities; else this shard holds this modality's Items
}

For "mirror"shard_key is null; all mirrors hold identical content.

3.3 Why a separate ObjectKind (not just a Manifest with extra fields)

Federation Manifests differ from per-backend Manifests in three load-bearing ways:

  1. They reference external backends. Per-backend Manifests resolve every hash within their own backend; Federation Manifests reference backends by URL.
  2. They carry partial-result semantics. default_quorum lets a query succeed when some shards are unavailable; per-backend Manifests have no such notion (a missing Object is always a hard error).
  3. They carry trust metadata. Per-backend Manifests assume the local backend is trusted; Federation Manifests must validate hashes from third-party backends explicitly (§5).

Folding these into per-backend Manifests would force every SDK to handle URL-resolution, quorum, and capability tokens even for single-backend deployments — exactly the complexity spec/0000 §5.2 says to avoid.

3.4 Federation Manifests as Manifests (recursion)

A Federation Manifest MAY reference another Federation Manifest in its shards[i].manifest_hash slot (provided the referenced Object's path indicates federation-manifests/). This permits tree-of-trees composition: a "global" Federation Manifest covers continents, each continent's manifest covers regions, each region's manifest covers backends. Recursion depth is bounded by the SDK at fetch time (default cap: 4; configurable).

4. The federate verb

Resolves OQ-40. A protocol-level primitive that copies a Manifest's transitive closure from a source backend to a destination backend.

4.1 Push mode

The source backend's operator initiates:

POST <dest_backend>/v1/federate/push
Content-Type: application/cbor

{
  "manifest_hash":   <multihash>,
  "source_backend":  <url>,
  "capability":      <bytes>,                   ;; (§5) authorizes write to dest
  "include_refs":    [<ref-name>, ...] | null,  ;; optional: also create dest-side Ref(s)
}

The destination MUST:

  1. Validate the capability token against its local policy.
  2. Walk the source Manifest's transitive closure (Track Objects, Bucket Objects, VS Objects, Index Pages, etc.) by fetching from source_backend.
  3. For every Object fetched, verify its content hash matches its path before storing locally. A path/byte mismatch is a critical error: the destination MUST refuse and abort the push.
  4. PUT each verified Object to its local backend at its canonical path. PUTs are idempotent by content-address; collisions are no-ops.
  5. If include_refs was specified, create/update those refs locally to point at the freshly federated manifest_hash (CAS as usual; failure surfaces).
  6. Respond with a summary: count of Objects transferred, count already-present, total bytes.

The push is at-least-once — if it crashes mid-walk and is retried, the idempotent PUTs make repetition safe.

4.2 Pull mode

The destination backend's operator initiates:

POST <source_backend>/v1/federate/pull-manifest-list
Content-Type: application/cbor

{
  "ref_name":  "refs/main",                     ;; or a specific manifest_hash
  "since":     <multihash | null>,              ;; only deltas since this Manifest; null ⇒ full closure
  "capability": <bytes>,                        ;; (§5) authorizes read from source
}

Response: ordered list of Object hashes the destination needs. Destination then fetches each from source_backend (existing per-Object GET path; spec/0005). Same hash-verify-before-store discipline as §4.1 step 3.

Pull mode is the typical pattern for receive-side-driven federation (mirror sites, archive nodes, low-trust peers). The source backend does not need write access to the destination; only the destination needs read access to the source.

4.3 Federation Refs

A Federation Ref is the cross-backend analog of a per-backend Ref (spec/0008 §4). Path:

federation-refs/<ref-name>

It points at a Federation Manifest hash. Advancement via CAS: the destination provides an expected_etag, the source-of-truth backend (typically a designated coordinator backend, but any one of the federation participants can play this role) performs the CAS as if it were a normal Ref update (spec/0005 §5.3).

Critically, only ONE backend in the federation holds the authoritative Federation Ref. Mirror backends pull from it; they do not race to advance it. This avoids cross-backend CAS, which is intractable at WAN scale.

5. Trust & capability model

Federation crosses administrative boundaries. The protocol assumes mutual suspicion:

5.1 Hash verification is mandatory

For every Object fetched from a non-local backend, the receiver MUST verify the BLAKE3 hash matches the path's <hash> segment before storing or using the bytes. spec/0002's "the path IS the hash" guarantee makes this O(bytes) and unavoidable.

A backend serving a different byte sequence under a known hash is provably misbehaving. The receiver MUST log, abort the current operation, and (operator-defined) blacklist the source.

5.2 Capability tokens

shards[i].trust and federate verb requests carry a capability token. The minimal format:

{
  "issuer":      <url>,                         ;; backend that issued this capability
  "subject":     <bytes>,                       ;; opaque identifier (typically pubkey or audience)
  "scope":       "read" | "write" | "admin",
  "expires_at":  <u64>,                         ;; unix-ns expiry
  "signature":   <bytes>,                       ;; signature over the above fields by issuer's pubkey
}

signature algorithm: Ed25519 mandatory (the only required algorithm). The signing pubkey is published by issuer at a well-known path (<issuer>/.well-known/dreamdb/federation-pubkey); receivers cache it.

Multi-tenant deployments extend this format with tenant_id and scope_path fields (per 0018 §3). The base shape above is the federation-only minimum; the extended shape is REQUIRED when the issuer is a multi-tenant operator. Verifiers MUST accept the extended shape via the map-extensibility rule (0002 §3.1.3).

A backend MUST refuse a federation operation if the capability token's signature doesn't verify OR the token is expired. Read-only operations MAY proceed without a token if the source backend's policy permits anonymous reads (e.g. public datasets).

5.3 Out of scope for v0.X federation auth

  • OAuth2 / OIDC integration. Capability tokens are intentionally minimal; integration with external IdPs is operator-layer concern.
  • Per-Object ACLs. DreamDB's content-addressed model gives Object granularity at the path level; finer-grained access control is a non-goal.
  • Encryption at rest / in transit. spec/0005 mandates HTTPS for the wire; per-Object encryption is application-layer.
  • Revocation. Capability tokens have an expiry; short-lived tokens (minutes-to-hours) are the recommended pattern. Long-lived revocation is operator-layer.

6. Cross-cluster query semantics

A query against a Federation Manifest evaluates each shards[i] and merges results.

6.1 Scatter-gather

For each shard s in fm.shards:
  Spawn (concurrent) query against s.backends[j] for s.manifest_hash with the original query.
  Apply s.shard_key to prune: if the query's anchor range doesn't overlap s.anchor_range, skip s.
Gather responses with a per-shard timeout (operator default: 1.5× single-backend p99).
Merge per shard_role:
  - "mirror"       — first non-error response wins; lateness cancels remaining
  - "shard"        — concatenate
  - "hybrid-range" — concatenate; downstream rank merges by score

If fewer than fm.default_quorum shards return successfully, the query returns a partial result with a partial: true flag and the list of failed shards. The application decides whether to retry or accept partial.

6.2 Vector queries across shards

For vector ANN, scatter-gather with rank merge requires care: each shard returns its local top-K, and the merge step keeps the global top-K by score. This is mathematically correct ONLY if each shard's local top-K contains the global top-K's members from that shard — i.e., only if every shard with non-zero candidates returns at least K results. Sparse shards (where the query has fewer than K candidates) fall out of this guarantee.

Practical mitigation: query each shard for top-K × oversample where oversample ≥ 4. Empirically this closes the residual recall gap to <1% for typical embedding distributions.

6.3 Time-range queries across shards

For time-range queries on hybrid-range shards, the anchor_range field allows the query planner to prune shards that don't intersect the query's time window. This is a major efficiency win at 10B-scale where most queries touch a small time window.

6.4 Latency budget at 10B-scale

Example: 10 shards, each holding 1B vectors, dim=768, IMI+QINCo (per spec/0010), tables=1, hot caches everywhere.

Per-shard latency (per spec/0004 §7.3 plus QINCo from spec/0010 §10): ~30–80 ms p50.

Scatter-gather fan-out is concurrent over HTTP/2. The cross-cluster latency is bounded by the slowest shard, plus inter-shard rank merge (negligible, <5 ms for K_max=10000).

End-to-end p50: ~40–100 ms — essentially the same as single-backend, because the shards process in parallel. The p99 increases proportionally to fan-out ((1 - (1-p99)^N) ≈ N × p99 for small p99), so at N=10 the p99 climbs from ~150 ms (single) to ~500–800 ms. Operators mitigate via per-shard hedged requests, partial-result fallback, or capacity overprovisioning.

7. Closure walk algorithm

Both push and pull modes need to enumerate every Object reachable from a Manifest. The receiver computes the set; the source serves bytes on demand.

7.1 Walk

mark_set = {}
queue    = [manifest_hash]
while queue:
  h = queue.pop()
  if h in mark_set: continue
  mark_set.add(h)
  obj_bytes = GET source_backend / path-for-hash(h)
  verify hash(obj_bytes) == h          ;; CRITICAL — §5.1
  references = scan_outbound_hashes(obj_bytes)
  queue.extend(references)

scan_outbound_hashes is the union of:

  • Manifest → Track Objects, registry SpatialIndex / ScalarIndex / VectorCompressor hashes, parent Manifest hashes.
  • Track Object → Bucket Objects, Index Pages, VS Objects.
  • Index Page → Bucket / VS Objects, child Index Pages.
  • Bucket / VS Object → no further outbound (leaf).
  • Federation Manifest → component Manifests; recurse with depth cap (§3.4).

7.2 Delta walks

since: <prev_manifest> short-circuits the walk: if the receiver already has prev_manifest's closure, only the set difference is fetched. Implementation: compute prev_manifest's closure separately (or maintain a persistent reachable-hashes set per peer), subtract from current closure.

The set difference is typically tiny for incremental federation — a single Track append touches O(log N) Objects (new Bucket(s), updated Index Pages on the path, the Manifest itself). 10B-scale Timelines incremental sync is ~kB-MB per update.

8. Storage and bandwidth at 10B-scale

Worked example. 10 mirror sites for a 1B-item Timeline.

  • Per-mirror storage: 8 GB (QINCo-compressed; per spec/0010 §10). Trivial.
  • Initial full sync: 8 GB × 10 mirrors = 80 GB outbound from primary. At 1 Gbps, ~10 hours; at 10 Gbps, ~1 hour. One-time cost.
  • Steady-state delta: ~10 MB per Manifest publish (Track Object + Index Pages + new Bucket(s)). 10 mirrors × 10 MB = 100 MB per publish.
  • Federation Manifest itself: ~100 bytes × len(shards). Negligible.

Worked example. Sharded Timeline, 10 hybrid-range shards, 1B items each, 10B total.

  • Per-shard storage: 8 GB. 10 shards total = 80 GB.
  • No replication; if a shard goes down, that anchor range is unreachable. Mitigation: combine shard+mirror (each shard has 2 mirrors → 20 backends, ~160 GB).
  • Query bandwidth: 10 concurrent scatter requests, each ~512 KB of buckets fetched per query (QINCo-compressed). Total ~5 MB egress from the federation per query — well within ROI for 10B-item queries.

9. Failure modes & operational discipline

FailureDetectionResponse
Shard backend unreachableTCP timeout; per-shard deadlineMark shard failed; partial result if quorum still met
Shard returns wrong bytes for known hashHash verify step (§5.1)Critical: abort op, log, optionally blacklist
Federation Manifest stale (Federation Ref advanced)Optional periodic poll of Federation Ref (analog of per-backend Ref polling)Refresh; queue queries against new Manifest
Closure walk discovers Object not at source backendGET 404Critical for "push" (source is broken); soft for "pull from mirror" (try another shard backend)
Capability token expired mid-operationPer-request validationAbort current op; surface to caller; renew token
Mirror diverges (different bytes for same Manifest hash)Impossible by hash; if observed → backend corruptionTreat as §5.1 critical

The single hardest operational failure is the cross-backend Ref drift case: a Federation Ref advances to a Manifest hash that not every mirror has fully synced yet. SDKs handle this by falling back to lower hashes in the Federation Manifest's parent chain until they find one every queried mirror can resolve — bounded by the parent-chain depth and fetch_attempts budget.

10. Out of scope for v0.X

  • Strongly-consistent cross-backend writes. Federation Refs designate one authoritative backend per ref; multi-backend CAS is not provided. (Deferred indefinitely; operators wanting consensus layer it on top.)
  • Automatic shard rebalancing. When a hybrid-range shard outgrows its backend's capacity, the operator splits manually (publishes a new Federation Manifest with two narrower-range shards in place of the old one). Auto-rebalance defers to v0.X+1.
  • Read-repair / anti-entropy. If two mirrors disagree on the closure of a shared Manifest, the receiver detects via hash verification (§5.1) but does not auto-reconcile.
  • Geo-aware query routing. Each shards[i].backends[j] is an ordered list; the SDK tries backends[0] first. Smarter routing (latency-aware, jurisdiction-aware) is operator-layer.
  • Manifest streaming. Closure walks are batch operations. A future streaming pull (continuous tail-of-history) is interesting; defers to spec/0014.

11. Open questions

  • OQ-48 (→ this spec): Should the Federation Manifest's default_quorum be a numerator/denominator pair (allowing partial: true ⇒ 7/10 shards semantics)? Currently a single uint, which forces "majority" or "any-N" patterns. Decide after first multi-shard deployment.
  • OQ-49 (→ spec/0009): Conformance test vectors for the federate verb: walk correctness, hash-verify abort path, idempotent retry, partial-result semantics. Defer to spec/0009 amendment.
  • OQ-50 (→ this spec): Federation-level GC. Per-backend GC (spec/0006 §7.3) is per-backend; a Federation Manifest's "reachable set" needs to be communicated to every mirror's GC. Naïve: each mirror computes its own reachable set from its local Federation Manifest copies. Open question: does that miss anything in practice? Likely no, but warrants a test.
  • OQ-51 (→ spec/0013): Cross-shard vector index for vector ANN at federation level. The scatter-gather model (§6.2) costs N × per-shard-query work; an alternative is a global graph index (spec/0013) that routes queries to specific shards. Open until spec/0013 lands.
  • OQ-52 (→ this spec): Mirror-aware backends. Should a backend natively advertise "I am a mirror of X" via a well-known endpoint so receivers can short-circuit federation discovery? Probably yes; design deferred.

Next: spec/0013 (graph-based ANN) — addresses OQ-51 above and provides the global routing layer that scatter-gather assumes the application provides.