DreamDBv0.2.0bec026

Spec 0013 — Graph-Based ANN (Vamana / DiskANN family)

Status: Draft (Phase 3 design). Depends on: spec/0001, spec/0002, spec/0004, spec/0007, spec/0010, spec/0012. Motivation: spec/0004's partition algorithms (LSH/IVF/IMI) all share a per-query cost structure: O(cells_touched × items_per_cell) byte fetch. At billion-scale this is fine; at ten-billion-plus, the cell-touched budget grows faster than the recall floor allows, and the partition-based approach loses to graph-based approaches in the recall/cost ratio. Microsoft's DiskANN (Subramanya et al. 2019) and Vamana (the underlying graph algorithm) measure ~3–5× better recall@cost at 1B–10B scale than IVF-PQ. spec/0013 adds graph-based ANN as a peer of the partition-based algorithms — same SpatialIndex registry, same composable-with-spec/0010-compression contract — and additionally resolves spec/0012 OQ-51 by defining a global routing layer for cross-shard graph queries.


1. Purpose

DreamDB's spec/0004 partition algorithms answer "which cells does this query touch?" Graph algorithms answer a different question: "starting from a known entry point, which sequence of greedy local steps converges to the nearest neighbors?" The two are independently optimal — partition for selectivity-first, graph for recall-floor-first.

By the end of this document the following are concrete:

  • The graph-based SpatialIndex contract: an extension of spec/0004 §2 covering the entry-point + adjacency-list data model.
  • The GraphIndex Object: immutable, content-addressed CBOR Object that distributes the graph's metadata (R, alpha, entry point, page layout, seed).
  • The GraphPage Object kind: new on-disk format that holds a fixed-size batch of (node, vector, adjacency-list) tuples, page-aligned for I/O efficiency.
  • The Vamana algorithm (dreamdb.vamana-cosine): deterministic build + greedy search, suitable for 10B-scale.
  • The streaming-vs-batch build policy: how the graph is materialized at compaction time and (optionally) maintained incrementally.
  • The cross-shard routing layer: a "router GraphIndex" over shards, resolving spec/0012 §6.2's scatter-gather recall floor and OQ-51.
  • The compression interop: how dreamdb.vamana-cosine composes with dreamdb.qinco-cosine (spec/0010) to ship a "graph over compressed vectors" production deployment.

What stays defined elsewhere:

  • SpatialIndex Object format / registry — spec/0004 §3.
  • Bucket-internal vector compression — spec/0010.
  • Federation Manifest format — spec/0012 §3.
  • HTTP backend contract — spec/0005.

What this document does NOT define:

  • Learned graph algorithms (NGT/PANNG, learned-edge-prune networks). Defer to v0.X+1 after Vamana ships.
  • Hybrid graph+partition (HNSW-style hierarchical graphs). Vamana is a single-layer R-regular graph; hierarchy is a future addition.
  • Per-query graph mutation (online graph update as a side-effect of queries). DreamDB's content layer is strictly immutable; graph updates produce new GraphPage Objects (§5).

2. Graph-SpatialIndex contract (extends spec/0004 §2)

A graph-based SpatialIndex differs from a partition-based one in that it does NOT produce a deterministic spatial key. Instead it produces a search trajectory: a sequence of (node-id, partial-distance) tuples that the SDK consumes as it walks the graph.

GraphSpatialIndex.search(query, k) →
  Stream<(node_id, vector_bytes, score)>     ;; yielded in approximate score order

The contract:

  • Determinism (§2.1 of spec/0004): for the same query, the same GraphIndex Object, the same fetched GraphPage bytes, two implementations MUST yield the same trajectory in the same order.
  • Distributability (§2.2 of spec/0004): the SDK MUST be able to reproduce the search from the GraphIndex Object + the GraphPage Objects alone.
  • Locality: greedy local steps MUST converge toward the global nearest neighbors. Vamana's α-pruning property gives a provable convergence bound (Subramanya et al. 2019, Thm 1); we restate the practical implication in §4.

Partition-based and graph-based algorithms register under the same algorithm registry (spec/0004 §3.4). A modality declares one or the other in its registry entry; the SDK dispatches accordingly.

3. The GraphIndex Object

Address path:

graph-index/<multihash-of-canonical-CBOR-bytes>

(New top-level namespace, parallel to spatial-index/, scalar-index/, vector-compressor/, federation-manifests/.)

3.1 CBOR encoding

{
  "algorithm":     "<algorithm-id>",            ;; "dreamdb.vamana-cosine"
  "dim":           <unsigned int>,              ;; vector dimensionality
  "metric":        "<metric-id>",               ;; "cosine"
  "params":        <algorithm-specific>,        ;; §4 for Vamana
  "graph_layout": {
    "R":                  <unsigned int>,       ;; max out-degree (typical 32–64)
    "page_node_count":    <unsigned int>,       ;; nodes per GraphPage Object
    "page_bytes_target":  <unsigned int>,       ;; informational; ~64 KiB target
    "entry_point":        <unsigned int>,       ;; node-id of the search starting point
    "node_count":         <unsigned int>,       ;; total nodes in graph
    "vector_layout": {                          ;; how vectors appear inside GraphPages
      "compressor_hash":  <multihash | null>,   ;; null ⇒ raw f32; non-null ⇒ spec/0010 compressor
      "record_bytes":     <unsigned int>,       ;; per-node vector record size
    },
  },
  "shard_routing": <sub-Object | null>,         ;; (§6) non-null ⇒ this is a router GraphIndex
}

3.2 Reference from the Manifest registry

A modality using graph-based ANN declares it in the registry, parallel to spec/0004 §3.3:

"registry": {
  "embedding.f32.dim=768.graph.R=64": {
    "kind":              "continuous",
    "object_kind":       "graph-page",                    ;; NEW (this spec)
    "algorithm":         "dreamdb.vamana-cosine",
    "graph_index":       <multihash-GraphIndex-Object>,   ;; replaces spatial_index for graph-based
    "vector_compressor": <multihash-VC-Object | null>,    ;; optional, per spec/0010
    "rerank_storage":    <multihash-VS-Object | null>,    ;; optional re-rank path
  }
}

graph_index and spatial_index (spec/0004) are mutually exclusive in a single registry entry. A modality is partition-based OR graph-based, not both.

3.3 Modality-string parameterization

embedding.f32.dim=<D>.graph.R=<R>[.compress=<algo>:<param>]

The .graph token replaces .bucketed.spatial-bits=<N>. R is the graph's max out-degree. Compression is optional and composes per spec/0010.

Example:

  • embedding.f32.dim=768.graph.R=64.compress=qinco:M=8 — DiskANN-style: Vamana graph over QINCo-compressed vectors.

3.4 Registry-vs-GraphIndex consistency (mandatory validation)

The SDK MUST validate at load time, parallel to spec/0004 §3.3.1 and spec/0010 §3.3:

  1. Algorithm match: GraphIndex Object's algorithm equals registry algorithm.
  2. Dimensionality / R consistency: GraphIndex Object's dim and graph_layout.R match modality parameters.
  3. Vector layout: if vector_compressor is set in registry, GraphIndex's vector_layout.compressor_hash MUST equal it.
  4. Entry-point validity: entry_point < node_count.

Mismatch ⇒ reject Manifest as malformed (ManifestCorrupted).

4. dreamdb.vamana-cosine

The flagship graph algorithm. Vamana (Subramanya et al. 2019) is a directed R-regular graph with α-pruned edge selection: starting from a random graph, two passes refine edges so that greedy search converges quickly.

4.1 Params (CBOR)

{
  "version":       1,
  "alpha":         <f32-as-bytes>,              ;; α-pruning factor (default 1.2)
  "build_seed":    <32 bytes>,                  ;; RNG seed for deterministic build
  "L_build":       <uint>,                      ;; search-list size during build (typical 100)
  "L_search":      <uint>,                      ;; default search-list size at query time (typical 100)
  "build_passes":  <uint>,                      ;; typically 2; 1 produces lower-quality graphs
}

4.2 GraphPage Object format

Each GraphPage holds page_node_count nodes. Layout:

┌──────────────────────────────────────────────────────────┐
│ GraphPage Header (192 bytes, fixed)                      │
│   magic:               4 bytes  = 0x47504755 ("GPGU")    │
│   version:             u32      = 1                       │
│   page_index:          u32      = 0-based page number     │
│   node_count_in_page:  u32      = nodes in this page      │
│   first_node_id:       u64      = first node-id in page   │
│   R:                   u32      = max out-degree (echo)   │
│   vector_record_bytes: u32      = per-node vector bytes   │
│   graph_index_hash:    33 bytes (multihash of GraphIndex  │
│                         Object — lineage check, like      │
│                         spec/0007 §6.1.1)                 │
│   modality:            32-byte ASCII (zero-padded prefix) │
│   reserved:            remaining bytes = 0                │
├──────────────────────────────────────────────────────────┤
│ Node 0  (variable size; see below)                        │
│   time_anchor:    u64                                     │
│   vector_record:  vector_record_bytes bytes               │
│   adj_count:      u32                                     │
│   adj_list:       adj_count × u64 (node-ids)              │
├──────────────────────────────────────────────────────────┤
│ Node 1                                                    │
│ ...                                                       │
└──────────────────────────────────────────────────────────┘

Variable-size records (adj_count may vary per node) preclude pure positional indexing. The header's node_count_in_page plus a per-page offset table (appended after the last node) MAY be added in a follow-up if measurement shows it's needed; v0.X reads pages whole.

Address path:

<timeline>/<modality>/graph-page/<page-content-hash>

A node-id maps to (page-index, in-page-offset). The SDK consults the GraphIndex Object's graph_layout.page_node_count to compute the page-index for any node-id; the in-page-offset is found by scanning the page header forward (cheap — pages are typically 64 KiB).

4.3 Build algorithm (compaction-time)

Vamana is built during a special compaction phase (per spec/0006 §7.3's compaction verb). The build is deterministic given:

  • The set of input vectors (their content-addressed identity).
  • The GraphIndex Object's build_seed, L_build, alpha, build_passes.

Procedure:

  1. Random init: every node has R out-edges to randomly chosen other nodes. The randomness is ChaCha20-keyed off build_seed (same primitive as spec/0004 §5.3).
  2. Pass 1 (alpha=1.0): for each node v in deterministic order (sorted by node-id): a. Greedy-search from entry_point (or current pass's entry point) toward v with search-list size L_build. b. The visited set V_v becomes the candidate edge set. c. α-prune: starting from the nearest candidate, add edges greedily; for each candidate c, accept iff no already-accepted edge c' satisfies dist(c, c') × α < dist(v, c). (At α=1.0 this is standard nearest-neighbor pruning.) d. Cap at R out-edges; if fewer than R accepted, fill with the closest unrejected candidates.
  3. Pass 2 (alpha=user-supplied, default 1.2): repeat the pass with the user-supplied α. The diversity-boosting effect of α > 1 gives better recall at high L_search.
  4. Materialize GraphPages: assign node-ids to pages in node-id order; emit each GraphPage Object.

Determinism notes:

  • entry_point MUST be deterministic. The convention: pick the medoid of the first 10,000 vectors (or all vectors if fewer); ties broken by smallest node-id. Approximate medoid (cheap to compute) suffices; full medoid is too expensive at 10B.
  • All inner-loop floating-point operations follow spec/0004 §5.4.1 discipline (scalar reference path, no FMA, no -ffast-math).
  • The build is single-threaded by spec. Parallel build is permitted ONLY if it produces bit-identical results (typically requires careful work-stealing with deterministic seed-propagation). Implementations that can't guarantee bit-identical parallel build MUST fall back to single-thread for conformance test vectors.

4.4 Search algorithm

At query time, given query vector q and target k:

visited = {entry_point}
candidates = MinHeap[(dist(q, entry_point), entry_point)]
results = MaxHeap[max-size=L_search]
while candidates not empty:
  (d, v) = candidates.pop_min()
  if d > results.max_value() and len(results) == L_search: break
  v_page = fetch GraphPage covering v          ;; cached; spec/0005 §3.3
  for adj_node in v_page.adj_list_for(v):
    if adj_node in visited: continue
    visited.add(adj_node)
    adj_dist = dist(q, vector_for(adj_node))   ;; uses ADC (spec/0010 §5.1) if compressed
    candidates.push((adj_dist, adj_node))
    results.push_if_better((adj_dist, adj_node))
return results.top_k()

Per spec/0010 interop: when the GraphIndex references a vector_compressor, dist(q, vector_for(adj_node)) is computed via the ADC lookup table built once at query start.

Latency profile (1B-scale, R=64, compressed with QINCo M=8, L_search=100):

  • ~150–300 hops per query (Vamana on 1B-scale; varies with α).
  • Each hop fetches one GraphPage (~64 KiB) — most are warm-cache after the first ~50 hops cover the relevant graph region.
  • Cold-start: ~50 cold GETs × ~50 ms over WAN = unacceptably slow.
  • Warm-cache: ~250 hops × ~50 µs of local memory access + ~50 µs ADC scoring = ~25 ms per query.

The cold-start latency is why a hot in-memory cache of the entry-point region is mandatory for production deployments. The SDK is RECOMMENDED to pre-fetch and pin the GraphPages within K hops of entry_point at session start (K=3 typical).

4.5 Storage cost

For 1B nodes, R=64, dim=768 with QINCo M=8 compression:

  • Per node: 8 (anchor) + 8 (compressed vector) + 4 (adj_count) + 64×8 (adj_list) = 532 bytes.
  • Per page (256 nodes): 256 × 532 = ~133 KB raw, ~64 KiB after typical compression (zstd at REST is operator-layer).
  • Total storage: 1B × 532 B = 532 GB.

Compared to spec/0010 §10's QINCo IVF: 8 GB for the same dataset. Graph-based storage is ~66× larger. The trade is paid for in recall: Vamana achieves ~0.99 recall@10 at L_search=100, vs IVF-QINCo at ~0.97 with re-rank.

For 10B-scale where recall floor matters more than storage absolute, graph wins. For <1B-scale or storage-constrained deployments, IVF+QINCo is still the right answer.

5. Incremental updates (deferred; design sketch)

DreamDB's content layer is strictly immutable. A streaming-update FreshDiskANN-style mechanism would produce new GraphPage Objects per insertion, rewriting affected pages. This is expensive (~kB-MB write amplification per insertion) but tractable.

v0.X DEFERS streaming-graph updates. The recommended pattern:

  1. Continuous ingest goes to a partition-indexed Track (spec/0004 IVF or IMI, possibly with spec/0010 compression).
  2. Periodic compaction (operator-scheduled, e.g. nightly or per-million-items) builds a graph-indexed Layer Track (spec/0008 Layer pattern) over the same underlying VS Object.
  3. Queries that need the graph's recall floor target the graph-Layer Track; queries that need fresh data target the partition Track. The application chooses per use case.

Streaming graph update (dreamdb.fresh-vamana-cosine?) is a v0.X+1 follow-up.

6. Cross-shard routing (resolves spec/0012 OQ-51)

The scatter-gather model in spec/0012 §6.2 incurs O(N_shards × per-shard-query) work. For vector queries at extreme scale (10B+ across 100+ shards), this dominates. A graph-based router resolves it.

6.1 The router GraphIndex

A "router" GraphIndex is a normal GraphIndex (per §3) whose shard_routing field is non-null:

"shard_routing": {
  "shards": [
    {
      "shard_index":     <uint>,                ;; index into the parent Federation Manifest's shards[]
      "centroid_vector": <bytes>,               ;; representative point for routing
      "shard_size_log2": <u8>,                  ;; informational; for query planner
    },
    …
  ]
}

The router graph's nodes are per-shard centroids (or representative samples) — typically thousands of nodes covering hundreds of shards. The Federation Manifest references the router GraphIndex via a new optional field:

;; Extension to spec/0012 §3.1 Federation Manifest
"shard_router": <multihash-of-router-GraphIndex | null>,

6.2 Router-driven query path

1. Compute query vector q's compressed form (using the federation-level VectorCompressor if any).
2. Greedy-search the router GraphIndex toward q to identify top K_router shards (typically K_router = 4–8).
3. Scatter queries to those K_router shards ONLY (not all N).
4. Gather and merge per spec/0012 §6.1.

This reduces fan-out from N to K_router, typically 10–30× cost reduction at 100-shard federations. The router is built from per-shard summaries (centroids or coreset samples) at federation-publish time.

6.3 Router build and refresh

The router is built by the federation operator's coordinator backend. Inputs: each shard's centroid or coreset, fetched once per federation refresh cycle. Build is the standard §4.3 algorithm at much smaller scale (typical: 10³–10⁴ nodes).

Refresh cadence is operator-defined; recommended every ~6 hours or when shard set changes by >1%. The router GraphIndex is content-addressed like any other; a federation publish that changes the router emits a new Federation Manifest with the new router hash.

7. Determinism conformance

The combination of f32 left-fold + scalar-reference (spec/0004 §5.4.1) + neural-MLP determinism (spec/0010 §7.3.1) covers most of what Vamana needs. Additional graph-specific requirements:

  • Build node order: ascending node-id. No exceptions; even parallel implementations process completion in this order.
  • Tiebreak in greedy search: when two candidates have identical distance, the smaller node-id wins. Ties at scale are rare but must be deterministic.
  • Random tiebreak in α-pruning: when α-prune evaluation has identical results for two candidates, the smaller node-id wins. (Not a random tiebreak — deterministic.)

The conformance suite (spec/0009 §5.X) ships a small reference graph (dim=64, 10K nodes, R=8) plus build and search test vectors.

8. Storage and latency at 10B-scale

Worked example. 10 backends, 1B vectors each, dim=768, QINCo M=8 + Vamana R=64.

  • Per-backend graph storage: 532 GB (per §4.5).
  • Total federation storage: 5.32 TB.
  • Cross-shard router: 10⁴ nodes × ~500 B = ~5 MB. Negligible.

Latency:

  • Cold-start at scale: ~30 ms for router-graph greedy (10⁴ nodes, all warm if router is pinned) + ~30 ms × K_router shards scatter (parallel) + ~30 ms gather/rerank = ~90 ms p50.
  • Warm-cache: ~25 ms total — the router fan-out reduction means we only contact 4–8 shards instead of 10.

Compared to spec/0012 §6.4 scatter-gather without router: ~40–100 ms p50 single-backend × N=10 = bounded by slowest shard, but p99 grows linearly with N. The router reduces effective N to K_router, dropping p99 proportionally.

9. Out of scope

  • HNSW (hierarchical small-world graphs). Vamana is single-layer; HNSW's multi-layer hierarchy is a different storage/latency tradeoff. Defer to v0.X+1.
  • Learned-edge graphs (NSG/SSG with learned pruning policies). Vamana's α-pruning is closed-form; learned variants require a training pipeline. Defer.
  • Graph-aware deletion. Vamana under v0.X is append-and-rebuild. Tombstone-based deletion is theoretically tractable; defer.
  • Concurrent build. v0.X is single-thread for conformance. A deterministic-parallel build is an open research question; defer.

10. Open questions

  • OQ-53 (→ this spec): GraphPage offset tables. The current variable-size record format requires scanning to find a specific node within a page. Add a per-page positional index? Defer until benchmarks show scan cost dominates.
  • OQ-54 (→ 0016): Streaming graph updates. Vamana doesn't natively support insertion; FreshDiskANN does. Resolved: 0016 §4 ships dreamdb.fresh-vamana-cosine as the streaming variant — append + background consolidation + tombstone deletes. Mid-iteration learned-graph approaches remain a v0.X+1 research item.
  • OQ-55 (→ spec/0009): Conformance test vectors for dreamdb.vamana-cosine: a small reference graph + build/search vectors. Block v0.X release on this.
  • OQ-56 (→ spec/0010): When using graph + compressor, should the re-rank step (spec/0010 §5) re-rank by fetching VS Object records, or by re-running graph search on raw f32 only over the top-K candidates? Probably the former; design defers to first implementation.

Next: spec/0014 — streaming/CMAF extensions for video at scale (or revisit chunking spec).