Spec 0012 — Federation & Cross-Cluster Queries
Status: Draft (Phase 3 design).
Depends on: spec/0001, spec/0002, spec/0005, spec/0008.
Motivation: spec/0008 §8.3 acknowledges that a Manifest's content closure can be served from any backend that holds it but defers a protocol-level replication primitive (OQ-40). spec/0008 §10 also explicitly punts "Cross-Space federation primitives" to v0.1+. At billion-scale, single-backend DreamDB already works (ref-implementation hits ~1B vectors per Track). Ten-billion-scale demands cross-backend operation by definition: one S3 region's tail latencies, capacity ceilings, and blast radius force operators to shard across regions/clouds/clusters. This spec defines the on-wire and operational primitives that make cross-backend DreamDB coherent rather than a pile of mutually-opaque single-cluster deployments.
1. Purpose
Federation is the discipline of treating N backends as one logical DreamDB without:
- Centralizing the data plane (would break the "lock-free data plane" principle from spec/0000 §5.2).
- Inventing a new naming layer (addresses remain
<timeline>/<modality>/...; the timeline IS the unit of federation). - Requiring transactional coordination across backends (impossible at WAN scale; spec/0005 already states eventual consistency for refs).
By the end of this document the following are concrete:
- The federation models: mirror (whole-Timeline replication), shard (per-Timeline partitioning), hybrid (mirror + shard).
- The Federation Manifest: a new ObjectKind that points at component Manifests across backends. Manifest-of-Manifests for cross-backend queries.
- The
federateverb: protocol-level primitive to push or pull a Manifest's transitive closure between backends. Resolves OQ-40. - The cross-cluster query path: scatter-gather over Federation-Manifest children with bounded fan-out and partial-result semantics.
- The trust & capability model: read-only by default, capability tokens for write-through, per-backend hash verification mandatory.
- Storage and bandwidth budgets at 10B scale.
What stays defined elsewhere:
- The address grammar — spec/0002.
- Per-Timeline content addressing & immutability — spec/0001, spec/0002.
- The HTTP backend contract — spec/0005.
- Manifest DAG and refs — spec/0008.
What this document does NOT define:
- Cross-backend transactions. Federation is eventually consistent across backends; spec/0008's CAS guarantee remains a per-backend property.
- Quorum-based consistency models (Paxos, Raft). Operators wanting strong cross-backend consistency layer their own consensus on top of single-backend DreamDB.
- Service discovery / registry. How a client learns "backend B exists" is operator-coordinated — DNS, a config file, a service mesh.
- Per-modality cross-backend indexing (e.g. a global vector index spanning N backends). Deferred to spec/0013 (graph-based ANN with cross-shard routing).
2. Federation models
Three operational patterns. A deployment MAY combine them per-Timeline.
2.1 Mirror
The same Timeline T is published to backends A and B. Both hold the transitive closure of T's latest Manifest. Read queries route to either backend; write coordination is operator-defined (one-writer per Ref is the simplest; spec/0008 §6 covers multi-writer rebase).
Use case: HA / DR; serving reads from the geographically closest backend.
Storage cost: 2× (or N× for N mirrors). Bandwidth cost: one-time closure transfer per Manifest publish; ongoing delta transfer per write.
2.2 Shard
Different Timelines live in different backends. T_a in backend A, T_b in backend B, T_c in backend C. A cross-backend query spans the Federation Manifest that points at all three.
Use case: capacity scaling (one S3 region too small or too expensive); jurisdictional sharding (EU data in EU bucket); per-tenant isolation.
Storage cost: 1× — each Timeline lives in exactly one backend. Bandwidth cost: zero replication; per-query scatter-gather instead.
2.3 Hybrid
A Timeline is sharded by anchor range or by Track. Backend A holds Tracks 0–999; backend B holds Tracks 1000–1999. The Federation Manifest carries the sharding rule.
Use case: ultra-large Timelines (>10B Items in one logical stream) where mirror is too expensive and shard-by-Timeline loses cross-Timeline query coherence.
Storage cost: 1×. Bandwidth cost: per-query scatter-gather, partial — only the shard(s) covering the query range participate.
3. The Federation Manifest
A Federation Manifest is a new ObjectKind that names component Manifests living in distinct backends. It is content-addressed like any other Manifest; its hash is independent of the backends that resolve it.
Address path:
(New top-level namespace, parallel to manifests/. Federation Manifests are themselves NOT bound to a single backend; the same Federation Manifest hash MAY be served from any backend that holds the bytes.)
3.1 CBOR encoding
parents, like the Manifest DAG in spec/0008, supports atomic federation-level commits: a new Federation Manifest pointing at a different shard composition is the equivalent of an in-cluster Publish. Federation Refs (§4.3) advance via CAS exactly like per-backend Refs.
3.2 Shard-key sub-Object
For shard_role = "shard" (per-Timeline shard) or "hybrid-range" (within-Timeline anchor-range shard):
For "mirror" — shard_key is null; all mirrors hold identical content.
3.3 Why a separate ObjectKind (not just a Manifest with extra fields)
Federation Manifests differ from per-backend Manifests in three load-bearing ways:
- They reference external backends. Per-backend Manifests resolve every hash within their own backend; Federation Manifests reference backends by URL.
- They carry partial-result semantics.
default_quorumlets a query succeed when some shards are unavailable; per-backend Manifests have no such notion (a missing Object is always a hard error). - They carry trust metadata. Per-backend Manifests assume the local backend is trusted; Federation Manifests must validate hashes from third-party backends explicitly (§5).
Folding these into per-backend Manifests would force every SDK to handle URL-resolution, quorum, and capability tokens even for single-backend deployments — exactly the complexity spec/0000 §5.2 says to avoid.
3.4 Federation Manifests as Manifests (recursion)
A Federation Manifest MAY reference another Federation Manifest in its shards[i].manifest_hash slot (provided the referenced Object's path indicates federation-manifests/). This permits tree-of-trees composition: a "global" Federation Manifest covers continents, each continent's manifest covers regions, each region's manifest covers backends. Recursion depth is bounded by the SDK at fetch time (default cap: 4; configurable).
4. The federate verb
Resolves OQ-40. A protocol-level primitive that copies a Manifest's transitive closure from a source backend to a destination backend.
4.1 Push mode
The source backend's operator initiates:
The destination MUST:
- Validate the capability token against its local policy.
- Walk the source Manifest's transitive closure (Track Objects, Bucket Objects, VS Objects, Index Pages, etc.) by fetching from
source_backend. - For every Object fetched, verify its content hash matches its path before storing locally. A path/byte mismatch is a critical error: the destination MUST refuse and abort the push.
- PUT each verified Object to its local backend at its canonical path. PUTs are idempotent by content-address; collisions are no-ops.
- If
include_refswas specified, create/update those refs locally to point at the freshly federatedmanifest_hash(CAS as usual; failure surfaces). - Respond with a summary: count of Objects transferred, count already-present, total bytes.
The push is at-least-once — if it crashes mid-walk and is retried, the idempotent PUTs make repetition safe.
4.2 Pull mode
The destination backend's operator initiates:
Response: ordered list of Object hashes the destination needs. Destination then fetches each from source_backend (existing per-Object GET path; spec/0005). Same hash-verify-before-store discipline as §4.1 step 3.
Pull mode is the typical pattern for receive-side-driven federation (mirror sites, archive nodes, low-trust peers). The source backend does not need write access to the destination; only the destination needs read access to the source.
4.3 Federation Refs
A Federation Ref is the cross-backend analog of a per-backend Ref (spec/0008 §4). Path:
It points at a Federation Manifest hash. Advancement via CAS: the destination provides an expected_etag, the source-of-truth backend (typically a designated coordinator backend, but any one of the federation participants can play this role) performs the CAS as if it were a normal Ref update (spec/0005 §5.3).
Critically, only ONE backend in the federation holds the authoritative Federation Ref. Mirror backends pull from it; they do not race to advance it. This avoids cross-backend CAS, which is intractable at WAN scale.
5. Trust & capability model
Federation crosses administrative boundaries. The protocol assumes mutual suspicion:
5.1 Hash verification is mandatory
For every Object fetched from a non-local backend, the receiver MUST verify the BLAKE3 hash matches the path's <hash> segment before storing or using the bytes. spec/0002's "the path IS the hash" guarantee makes this O(bytes) and unavoidable.
A backend serving a different byte sequence under a known hash is provably misbehaving. The receiver MUST log, abort the current operation, and (operator-defined) blacklist the source.
5.2 Capability tokens
shards[i].trust and federate verb requests carry a capability token. The minimal format:
signature algorithm: Ed25519 mandatory (the only required algorithm). The signing pubkey is published by issuer at a well-known path (<issuer>/.well-known/dreamdb/federation-pubkey); receivers cache it.
Multi-tenant deployments extend this format with tenant_id and scope_path fields (per 0018 §3). The base shape above is the federation-only minimum; the extended shape is REQUIRED when the issuer is a multi-tenant operator. Verifiers MUST accept the extended shape via the map-extensibility rule (0002 §3.1.3).
A backend MUST refuse a federation operation if the capability token's signature doesn't verify OR the token is expired. Read-only operations MAY proceed without a token if the source backend's policy permits anonymous reads (e.g. public datasets).
5.3 Out of scope for v0.X federation auth
- OAuth2 / OIDC integration. Capability tokens are intentionally minimal; integration with external IdPs is operator-layer concern.
- Per-Object ACLs. DreamDB's content-addressed model gives Object granularity at the path level; finer-grained access control is a non-goal.
- Encryption at rest / in transit. spec/0005 mandates HTTPS for the wire; per-Object encryption is application-layer.
- Revocation. Capability tokens have an expiry; short-lived tokens (minutes-to-hours) are the recommended pattern. Long-lived revocation is operator-layer.
6. Cross-cluster query semantics
A query against a Federation Manifest evaluates each shards[i] and merges results.
6.1 Scatter-gather
If fewer than fm.default_quorum shards return successfully, the query returns a partial result with a partial: true flag and the list of failed shards. The application decides whether to retry or accept partial.
6.2 Vector queries across shards
For vector ANN, scatter-gather with rank merge requires care: each shard returns its local top-K, and the merge step keeps the global top-K by score. This is mathematically correct ONLY if each shard's local top-K contains the global top-K's members from that shard — i.e., only if every shard with non-zero candidates returns at least K results. Sparse shards (where the query has fewer than K candidates) fall out of this guarantee.
Practical mitigation: query each shard for top-K × oversample where oversample ≥ 4. Empirically this closes the residual recall gap to <1% for typical embedding distributions.
6.3 Time-range queries across shards
For time-range queries on hybrid-range shards, the anchor_range field allows the query planner to prune shards that don't intersect the query's time window. This is a major efficiency win at 10B-scale where most queries touch a small time window.
6.4 Latency budget at 10B-scale
Example: 10 shards, each holding 1B vectors, dim=768, IMI+QINCo (per spec/0010), tables=1, hot caches everywhere.
Per-shard latency (per spec/0004 §7.3 plus QINCo from spec/0010 §10): ~30–80 ms p50.
Scatter-gather fan-out is concurrent over HTTP/2. The cross-cluster latency is bounded by the slowest shard, plus inter-shard rank merge (negligible, <5 ms for K_max=10000).
End-to-end p50: ~40–100 ms — essentially the same as single-backend, because the shards process in parallel. The p99 increases proportionally to fan-out ((1 - (1-p99)^N) ≈ N × p99 for small p99), so at N=10 the p99 climbs from ~150 ms (single) to ~500–800 ms. Operators mitigate via per-shard hedged requests, partial-result fallback, or capacity overprovisioning.
7. Closure walk algorithm
Both push and pull modes need to enumerate every Object reachable from a Manifest. The receiver computes the set; the source serves bytes on demand.
7.1 Walk
scan_outbound_hashes is the union of:
- Manifest → Track Objects, registry SpatialIndex / ScalarIndex / VectorCompressor hashes, parent Manifest hashes.
- Track Object → Bucket Objects, Index Pages, VS Objects.
- Index Page → Bucket / VS Objects, child Index Pages.
- Bucket / VS Object → no further outbound (leaf).
- Federation Manifest → component Manifests; recurse with depth cap (§3.4).
7.2 Delta walks
since: <prev_manifest> short-circuits the walk: if the receiver already has prev_manifest's closure, only the set difference is fetched. Implementation: compute prev_manifest's closure separately (or maintain a persistent reachable-hashes set per peer), subtract from current closure.
The set difference is typically tiny for incremental federation — a single Track append touches O(log N) Objects (new Bucket(s), updated Index Pages on the path, the Manifest itself). 10B-scale Timelines incremental sync is ~kB-MB per update.
8. Storage and bandwidth at 10B-scale
Worked example. 10 mirror sites for a 1B-item Timeline.
- Per-mirror storage: 8 GB (QINCo-compressed; per spec/0010 §10). Trivial.
- Initial full sync: 8 GB × 10 mirrors = 80 GB outbound from primary. At 1 Gbps, ~10 hours; at 10 Gbps, ~1 hour. One-time cost.
- Steady-state delta: ~10 MB per Manifest publish (Track Object + Index Pages + new Bucket(s)). 10 mirrors × 10 MB = 100 MB per publish.
- Federation Manifest itself: ~100 bytes × len(shards). Negligible.
Worked example. Sharded Timeline, 10 hybrid-range shards, 1B items each, 10B total.
- Per-shard storage: 8 GB. 10 shards total = 80 GB.
- No replication; if a shard goes down, that anchor range is unreachable. Mitigation: combine shard+mirror (each shard has 2 mirrors → 20 backends, ~160 GB).
- Query bandwidth: 10 concurrent scatter requests, each ~512 KB of buckets fetched per query (QINCo-compressed). Total ~5 MB egress from the federation per query — well within ROI for 10B-item queries.
9. Failure modes & operational discipline
| Failure | Detection | Response |
|---|---|---|
| Shard backend unreachable | TCP timeout; per-shard deadline | Mark shard failed; partial result if quorum still met |
| Shard returns wrong bytes for known hash | Hash verify step (§5.1) | Critical: abort op, log, optionally blacklist |
| Federation Manifest stale (Federation Ref advanced) | Optional periodic poll of Federation Ref (analog of per-backend Ref polling) | Refresh; queue queries against new Manifest |
| Closure walk discovers Object not at source backend | GET 404 | Critical for "push" (source is broken); soft for "pull from mirror" (try another shard backend) |
| Capability token expired mid-operation | Per-request validation | Abort current op; surface to caller; renew token |
| Mirror diverges (different bytes for same Manifest hash) | Impossible by hash; if observed → backend corruption | Treat as §5.1 critical |
The single hardest operational failure is the cross-backend Ref drift case: a Federation Ref advances to a Manifest hash that not every mirror has fully synced yet. SDKs handle this by falling back to lower hashes in the Federation Manifest's parent chain until they find one every queried mirror can resolve — bounded by the parent-chain depth and fetch_attempts budget.
10. Out of scope for v0.X
- Strongly-consistent cross-backend writes. Federation Refs designate one authoritative backend per ref; multi-backend CAS is not provided. (Deferred indefinitely; operators wanting consensus layer it on top.)
- Automatic shard rebalancing. When a hybrid-range shard outgrows its backend's capacity, the operator splits manually (publishes a new Federation Manifest with two narrower-range shards in place of the old one). Auto-rebalance defers to v0.X+1.
- Read-repair / anti-entropy. If two mirrors disagree on the closure of a shared Manifest, the receiver detects via hash verification (§5.1) but does not auto-reconcile.
- Geo-aware query routing. Each
shards[i].backends[j]is an ordered list; the SDK triesbackends[0]first. Smarter routing (latency-aware, jurisdiction-aware) is operator-layer. - Manifest streaming. Closure walks are batch operations. A future streaming pull (continuous tail-of-history) is interesting; defers to spec/0014.
11. Open questions
- OQ-48 (→ this spec): Should the Federation Manifest's
default_quorumbe a numerator/denominator pair (allowingpartial: true ⇒ 7/10 shardssemantics)? Currently a single uint, which forces "majority" or "any-N" patterns. Decide after first multi-shard deployment. - OQ-49 (→ spec/0009): Conformance test vectors for the
federateverb: walk correctness, hash-verify abort path, idempotent retry, partial-result semantics. Defer to spec/0009 amendment. - OQ-50 (→ this spec): Federation-level GC. Per-backend GC (spec/0006 §7.3) is per-backend; a Federation Manifest's "reachable set" needs to be communicated to every mirror's GC. Naïve: each mirror computes its own reachable set from its local Federation Manifest copies. Open question: does that miss anything in practice? Likely no, but warrants a test.
- OQ-51 (→ spec/0013): Cross-shard vector index for vector ANN at federation level. The scatter-gather model (§6.2) costs
N × per-shard-querywork; an alternative is a global graph index (spec/0013) that routes queries to specific shards. Open until spec/0013 lands. - OQ-52 (→ this spec): Mirror-aware backends. Should a backend natively advertise "I am a mirror of X" via a well-known endpoint so receivers can short-circuit federation discovery? Probably yes; design deferred.
Next: spec/0013 (graph-based ANN) — addresses OQ-51 above and provides the global routing layer that scatter-gather assumes the application provides.