DreamDB Specification — 0004: Spatial Indexing
Status: Draft. Builds on
0000-overview.md,0001-data-model.md,0002-content-addressing.md, and0003-time-encoding.md. This document defines the spatial-key derivation algorithm — how a feature vector becomes the bit string that occupies the<spatial-key>slot in an address (0002§6.3). It resolves0000OQ-2.
1. Purpose
0000 Principle 3 says: retrieval is probabilistic localization, not scanning. 0002 defined the address slot for the spatial key (<spatial-key> segment, base2-encoded, N bits long). This document fills in the missing piece — the function that maps a feature vector to those N bits — and quantifies the locality property that makes the billion-scale benchmark feasible.
By the end of this document, the following are concrete:
- The SpatialIndex contract: signature and required properties of any function that computes spatial keys for a DreamDB modality.
- The SpatialIndex Object: the immutable, content-addressed byte format that distributes the function across writers and readers, so they all hash queries identically.
- The v0 default algorithm: LSH-cosine (random-hyperplane signature), specified precisely enough that two independent implementations produce bit-identical keys.
- The extensibility path: how future algorithms (PQ-IVF, learned hashing, etc.) plug in without breaking the address grammar or backend contract.
- The recall-widening strategy: prefix truncation as the v0 mechanism for trading bytes-fetched against recall.
- Quantified locality: collision probabilities and recall-vs-cost curves usable for sizing.
What stays as defined elsewhere:
- The
<spatial-key>slot in the address grammar —0002§6.3. - The base2 encoding of N-bit strings —
0002§6.3.2. - The Spatial-Bucket Object kind —
0001§3.1,0002§7.4.2. - Bucket sizing and the billion-scale target —
0000§1,project_billion_scale_time.md.
What this document does not fix:
- The byte format inside Spatial Bucket Objects (packed-array layout, headers, per-vector encoding) —
0007. - The per-modality Object index format (paged vs. inline, index page CBOR shape) — already in
0002§7.3. - Vector encoding standards beyond what the modality tag declares (
embedding.f32.dim=N); the spec is agnostic to which model produced the vectors.
2. The SpatialIndex Contract
A SpatialIndex is a deterministic, distributable function that maps an input vector to an N-bit string:
where D is the input dimensionality (declared by the modality tag, e.g. dim=768) and N is the spatial-bucket-bit-length (declared by the modality tag, e.g. spatial-bits=18).
Every SpatialIndex MUST satisfy three properties:
2.1 Determinism
Two implementations of the same SpatialIndex Object MUST produce bit-identical outputs for identical inputs. No tolerance for f32-vs-f64 drift, library-version differences, or platform-specific math. The conformance test suite (0009) MUST include a battery of input-output vectors that any conformant SpatialIndex implementation reproduces exactly.
This is a hard requirement, not a recommendation. A spatial-index implementation that "almost always agrees" is broken — addresses derived from disagreeing functions silently route to different buckets, and the protocol's content-addressed equality breaks.
2.2 Distributability
A SpatialIndex's full behavior MUST be reproducible from a small (typically KB to low-MB) immutable Object — the SpatialIndex Object (§3). No hidden state, no model registries, no environment variables. Hand the Object to any conformant SDK, and it can hash queries the same way the writer did.
2.3 Locality
Vectors that are "close" under the modality's similarity metric MUST be more likely to share a bit-prefix than vectors that are "far":
(In words: for the metric appropriate to the modality, more-similar pairs are more likely to share longer prefixes than less-similar pairs.)
Concrete locality guarantees are algorithm-specific — see §5 for the v0 default's quantified bound. The contract just says one must be provided.
3. The SpatialIndex Object
A SpatialIndex Object is the immutable, content-addressed Object that distributes a SpatialIndex's full behavior. It is treated like any other DreamDB content-layer Object: deterministic CBOR, BLAKE3-256 hash, idempotent put-by-hash to the backend.
3.1 CBOR encoding
algorithm, dim, bits, metric, params are required. params shape depends on algorithm; readers MUST reject Objects whose params don't match the expected shape for the named algorithm.
parents is OPTIONAL and OMITTED-WHEN-EMPTY: an encoder MUST NOT emit the key when the parents list is empty. This preserves byte-identical encoding for SpatialIndex Objects created before this revision — their content hashes are unchanged. Decoders treat a missing parents field as the empty list, which means the SI has no ancestors (it stands alone, with strict lineage matching per 0007 §6.1.2).
3.2 Address
A SpatialIndex Object lives at the backend key:
(In its own top-level namespace — like manifests/... in 0002 §7.2.3 — because a single SpatialIndex Object may be referenced by multiple Tracks across multiple Timelines.)
3.3 Reference from a Manifest
A modality that uses spatial bucketing (the embedding.* family with bucketed parameter) MUST declare its SpatialIndex Object hash(es) and the algorithm name in the Manifest's registry field (0002 §7.2 / §5.4):
For single-table modalities (tables=1 or absent), spatial_index is a one-element array. For multi-table (§6.2), it carries L distinct SpatialIndex Object hashes — different seeds, otherwise identical parameters.
3.3.1 Registry-vs-SpatialIndex consistency (mandatory validation)
When the SDK fetches the SpatialIndex Object(s) for a modality, it MUST validate that:
- Algorithm match: every fetched SpatialIndex Object's
algorithmfield equals the registry entry'salgorithmvalue. Mismatch → reject Manifest as malformed (ManifestCorruptederror). - Multi-table parameter consistency (when
L > 1): allLSpatialIndex Objects MUST agree on:algorithm(same string)dim(same input dimensionality)bits(same output bit-length)metric(same metric) They MAY differ only inparams(typically theseedfield that distinguishes tables). Any mismatch → reject Manifest as malformed.
- Modality-tag-vs-spec-fields consistency: the modality's parameters (
dim=N,spatial-bits=M, etc.) MUST match the SpatialIndex Object'sdimandbitsfields. A modality declaringdim=768referencing a SpatialIndex withdim=512is malformed.
These checks are mandatory; an SDK that silently accepts inconsistent registries produces silent under-recall (vectors hashed by the wrong projection land in the wrong buckets and are never matched). The spatial_index_hash field in Bucket headers (per 0007 §6.1.1) provides a redundant per-Bucket check; the registry-level check above catches the same class of bug at Manifest-load time, which is earlier and clearer.
The reader, on encountering a Bucket Object address with a given modality, consults the Manifest's registry to find the SpatialIndex Object hashes, fetches the SpatialIndex Objects (L small GETs, cached for the session), validates per the rules above, and uses them to hash queries identically to how the writer hashed Items.
3.4 Algorithm identifier grammar
Built-in algorithms are reserved under the dreamdb. prefix. v0 defines:
| Algorithm ID | Description | Defined in |
|---|---|---|
dreamdb.lsh-cosine | Charikar's signature LSH for cosine metric (seed-derived) | §5 |
dreamdb.ivf-cosine | Inverted-File partitioning via k-means centroids, cosine metric | §5.6 |
dreamdb.imi-cosine | Inverted Multi-Index — two half-codebooks, cosine metric | §5.7 |
dreamdb.lsh-cosine is the v0 default and the only algorithm v0 implementations are REQUIRED to support. dreamdb.ivf-cosine is OPTIONAL but RECOMMENDED for deployments with real-embedding data at ≤10⁵ effective cells. dreamdb.imi-cosine is the RECOMMENDED choice when the effective cell count would otherwise exceed ~10⁵ (typically beyond ~10⁸ vectors at constant cell size) — see §5.7 for the cost-vs-recall tradeoffs.
User-defined algorithms use reverse-DNS (com.example.myco.learned-hash) and MUST be declared in the Manifest's registry with sufficient detail for a reader to either dispatch to a known implementation or reject the Manifest as unsupported.
3.5 SpatialIndex lineage chain
parents declares the SpatialIndex Object(s) that the current SI evolved from. The dreamdb-cli ada-ivf-step and dreamdb-cli rebuild-ivf verbs SHOULD set parents = [<old-si-hash>] on every SI they publish.
parents is informational at the SpatialIndex level — it doesn't change spatial-key derivation — but it carries operational weight at the Bucket level: when an SI is published with non-empty parents, Bucket Objects placed under any ancestor SI MAY be reused under the new SI without rewriting, provided the bucket's centroid position is preserved between the ancestor and the descendant (see 0007 §6.1.2 for the lineage check rules).
3.5.1 Lineage Chain Semantics
A SpatialIndex's "lineage chain" is the transitive closure of parents walked depth-first. Each SI Object in the chain is one historical version of the index.
In this chain, a Bucket Object whose header records spatial_index_hash = SI_1 (the root) is "lineage-valid" under SI_3 IF AND ONLY IF the centroid position it occupies in SI_1 has been preserved (unchanged centroid vector) through SI_2 and SI_3. The chain walk is the necessary condition; the per-cell centroid-preservation check is the sufficient condition.
3.5.2 Chain Length Cap
Implementations SHOULD cap chain walks at 100 ancestors to bound the lineage check cost on long-lived datasets. Beyond the cap, the chain is treated as truncated and strict equality applies (any older Bucket must be rewritten).
3.5.3 Multi-Parent SIs
parents is an array, not a single hash, so an SI MAY descend from multiple ancestors. This is the foundation for future multi-parent merge of feature branches (per 0008). v0 sets the array to a single element; multi-parent semantics are not defined for v0 readers, who MUST treat such SIs as having only their first parent as the chain (forward-compatible with future merge support).
4. The Spatial-Key Derivation Pipeline
For a writer hashing an item or a reader hashing a query:
The pre-normalization step is metric-dependent: cosine metric requires unit-vector input; L2 metric does not. The algorithm-id implies the metric.
5. v0 Default Algorithm: dreamdb.lsh-cosine
dreamdb.lsh-cosine is the DreamDB v0 default and the only algorithm v0 implementations are required to support.
5.1 Mathematical definition
Given:
- An input vector
v ∈ R^D, L2-normalized so||v|| = 1. - N hyperplanes, each a unit vector
h_i ∈ R^D, sampled i.i.d. from the uniform distribution on the unit sphere.
For each i ∈ {0, 1, ..., N-1}:
(Convention: ties at <v, h_i> = 0 resolve to 1, making the function a total function on R^D.)
The output bit string is b_0 b_1 ... b_{N-1}, encoded as a base2 string in big-endian bit order — b_0 is the leftmost character of the spatial key.
5.2 Locality property (Charikar 2002)
For two unit vectors v_a, v_b with angle θ ∈ [0, π]:
For independent hyperplanes, the joint probability that all N bits agree is (1 - θ/π)^N. The expected prefix length for two queries v_a, v_b with angle θ is approximately 1 / (θ/π) bits (geometric distribution of the position of the first differing bit).
Practical implication for billion-scale recall (see §8 for a sizing table):
- At
θ = 5°(highly similar pairs):P[bit-collision] ≈ 0.972. Expected prefix length ≈ 36 bits. - At
θ = 10°:P ≈ 0.944. Expected prefix length ≈ 18 bits. - At
θ = 30°:P ≈ 0.833. Expected prefix length ≈ 6 bits.
5.3 Hyperplane generation from a seed
The params sub-Object is:
Hyperplanes are not stored directly — they are regenerated deterministically from the seed by every implementation. Storing 18 hyperplanes of dim 768 would be ~55 KB (small but unnecessary); storing the seed is ~32 bytes.
The deterministic generation procedure is fixed:
- Initialize a ChaCha20 stream cipher with the 32-byte seed as key, all-zero 12-byte nonce, counter 0. (ChaCha20 chosen because it is fast, has a clear specification, and is widely available across language runtimes — Rust's
chacha20, Go'scrypto/chacha20, Python'scryptography.) - For each hyperplane index
i ∈ {0, ..., N-1}: a. GenerateD × 4bytes of keystream from ChaCha20 (each hyperplane consumesD × 4bytes regardless of byte order; deterministic continuation). b. Interpret each consecutive 4 bytes as a little-endian signed 32-bit integern, then convert to a bounded f32 vian_as_f32 = (n as f32) / 2147483648.0(= 2³¹). This gives each element a finite f32 in approximately[-1.0, 1.0]. The vectorg_i ∈ R^Dis theDsuch values. Rationale: interpreting raw bytes as IEEE 754binary32directly admits NaN, infinity, and astronomically-large finite values that overflow the L2-norm sum-of-squares for typicalD. The integer-then-scale path produces bounded, always-finite f32 values whose distribution is roughly uniform on the unit sphere after normalization — sufficient for the cosine LSH locality property without requiring a Box-Muller transform (which would add f32 precision concerns). c. Computeh_i = g_i / ||g_i||(L2-normalize). If||g_i|| = 0(only possible ifDconsecutive 4-byte windows all happen to be0x00000000, which ChaCha20 doesn't produce in practice), advance the keystream byD × 4more bytes and retry.
The binary32 distribution from ChaCha20 keystream is not literally Gaussian, but the L2-normalization step projects to the unit sphere, which is what locality requires. (The strict-Gaussian distribution would require a Box-Muller transform, which adds f32 precision concerns; the projected uniform-on-sphere is locality-equivalent for cosine-metric use.)
Implementations MUST use this exact procedure. Conformance test vectors (0009) include known-seed → known-hyperplanes round-trips.
5.4 Floating-point determinism
LSH-cosine involves f32 arithmetic. Implementations MUST treat this carefully:
- Vectors are encoded as
binary32(IEEE 754 single-precision, little-endian) per the modality tag (embedding.f32.*). - Inner products
<v, h_i>MUST be computed via a deterministic order — naïve left-fold over the dimensions, accumulating in f32. Higher-precision intermediate (f64) accumulators are FORBIDDEN, since they introduce platform-dependent rounding. - The ChaCha20 keystream is bit-deterministic; the byte-to-f32 conversion is bit-deterministic. The only platform-specific step is the L2-norm and the subsequent dot-product fold; the spec pins f32 left-fold to remove ambiguity.
This is a tighter constraint than 0003 §4.1's no-f64-on-time-axis rule, because here we deliberately use f32 for vector math (it's how embeddings are sized for storage). The discipline is: f32 in, f32 out, deterministic accumulation order, no f64 widening.
5.4.1 Compiler-flag discipline (mandatory)
f32 left-fold determinism is necessary but not sufficient on its own. Compilers and CPUs have several subtle re-orderings that can produce divergent bits even on the "same" left-fold instruction stream:
- Floating-point contraction (
a × b + cfused to a singlefmainstruction): produces a different result than separate multiply + add becausefmarounds once instead of twice. LLVM's contraction is enabled by default on some platforms. - Implicit FMA under
-O2/-O3: even without-ffast-math, some compilers fuse multiply-adds opportunistically. -ffast-mathand equivalents: enable algebraic re-association (a + b + c → c + b + a) that breaks left-fold semantics. MUST be disabled.- CPU microarchitecture variation: same instruction stream can produce different bits across Skylake AVX-2 vs. Zen-3 AVX-2 in extreme corners (rare but observable).
Conformant implementations MUST:
- Disable floating-point contraction on every f32 path in the spatial-index pipeline. In Rust:
#[no_implicit_fma](when stabilized) or compile with-C llvm-args=-fp-contract=off. In C/C++:#pragma STDC FP_CONTRACT OFFor-ffp-contract=off. In Python: use a soft-float library; never trust NumPy's f32 path for canonical reference. - Forbid
-ffast-math,--unsafe-fp-math, and equivalents on any code path producing a DreamDB spatial key. - Implement the canonical scalar path explicitly — no SIMD intrinsics, no auto-vectorization. This path is the bit-exact reference per
0009§5.3.1. - Test against the canonical path on every supported architecture (x86-64 AVX-512/AVX2/scalar; ARM64 NEON/scalar) — see
0009§5.3.1 for the conformance battery.
For the canonical scalar reference: implementations SHOULD use a soft-float library (e.g. softfloat-rs, MPFR with binary32 precision) to produce vectors that are bit-exact regardless of host CPU. Hardware-f32 paths verify against the soft-float reference.
Without this discipline, two implementations on different hosts will produce different bits for some vectors and silently route them to different Spatial Buckets. The class of bugs is exactly what content-addressing's strong-consistency story exists to prevent — but only if implementations honor it on the f32 side.
Conformance test vectors include cases sensitive to f32 rounding-order to verify implementations match.
5.5 Why LSH-cosine for v0 and not something stronger
LSH-cosine is not the highest-recall spatial-index algorithm. PQ-IVF (FAISS-style) and learned hashing both achieve significantly better recall per byte fetched at billion-scale. Why is dreamdb.lsh-cosine the v0 default?
- No training data required. A writer publishing the first track on a Timeline cannot run k-means over 1B vectors that don't yet exist. LSH parameters are seeded; no fitting step.
- Algorithmic clarity. Charikar's locality bound is closed-form. Sizing decisions can be made from a calculator, not from running benchmarks.
- Implementation simplicity. A conformant LSH-cosine implementation is ~200 lines of Rust. PQ-IVF requires a quantizer, a residual-quantizer training pipeline, an asymmetric distance computation, and a much larger params blob.
- Forward compatibility. The algorithm registry in §3.4 makes PQ-IVF, learned hashing, etc. plug-in additions to the spec without breaking the address grammar or backend contract. Future spec versions ship them under
dreamdb.pq-ivf,dreamdb.learned-mlp-v1, etc.
The billion-scale benchmark with dreamdb.lsh-cosine will not match FAISS-IVF-PQ on recall — but it will match within a factor of 2-3× and is the right choice to ship a clean, reference-quality v0. Production deployments wanting higher recall can use a v0.1+ algorithm without changing anything about the data model or address grammar.
5.6 Extension Algorithm: dreamdb.ivf-cosine
dreamdb.ivf-cosine is an OPTIONAL extension that partitions the vector space via k-means centroids rather than random hyperplanes. It is the highest-recall algorithm v0 implementations are documented to support and is RECOMMENDED for deployments with real-embedding data (where natural cluster structure exists).
On SIFT-1M with k=1024, dreamdb.ivf-cosine at nprobe=32 reaches recall@10 = 0.97 versus dreamdb.lsh-cosine's 0.88 at the same probe budget — a 9-percentage-point lift, matching the faiss IndexIVFFlat reference within 0.001.
5.6.1 Params (CBOR)
The SpatialIndex Object's outer bits field MUST equal ceil(log2(k)) — this is the spatial-key width needed to encode every centroid id. Decoders reject mismatches with BitsTooNarrow.
Centroids on the wire are not required to be unit-norm; decoders MUST L2-renormalize on load (defense-in-depth — a malformed producer cannot break the cosine semantics for downstream readers).
5.6.2 Mathematical definition
Given:
- An input vector
v ∈ R^D, L2-normalized so||v|| = 1. - A trained centroid set
C = {c_0, c_1, …, c_{k-1}}, each L2-renormalized to unit length on load.
The spatial key is the id of the nearest centroid by cosine similarity:
For unit vectors, argmax dot product equals argmin angular distance equals argmin L2 distance — the algorithm is exact-equivalent to nearest-centroid in any of those metrics.
5.6.3 Multi-probe
Multi-probe IVF returns the top nprobe centroids by descending dot product (not Hamming-distance flips, which are an LSH-specific concept). The max_hamming_distance parameter accepted by the verb layer is documented as ignored by dreamdb.ivf-cosine implementations — accepted for API parity with dreamdb.lsh-cosine only.
Probe ordering: descending by <v, c_i>, ties broken by ascending id. Returns are deduplicated by spatial key (which can't collide given each centroid maps to a distinct id).
5.6.4 Floating-point determinism
dreamdb.ivf-cosine implementations MUST honor the same f32 discipline as dreamdb.lsh-cosine (§5.4):
- Left-fold accumulator for dot products.
- No FMA / fp-contract.
- Bit-identical scalar reference per architecture; SIMD paths MUST verify against the scalar reference on every supported architecture.
The hot path is k × D multiplies per query (versus bits × D for LSH) — meaningfully more arithmetic, but identical in its determinism requirements.
5.6.5 Training and the centroid-publication contract
Unlike dreamdb.lsh-cosine (where hyperplanes derive from a seed and need no training data), dreamdb.ivf-cosine requires a centroid set. This raises a coordination problem for multi-writer timelines: writers MUST agree on the centroid set, or their writes go into incompatible spatial keys.
The protocol resolves this via SpatialIndex-Object immutability:
- One writer trains and publishes the
SpatialIndexObject(Genesis-time or first-write). The SI Object's address — its content hash — is the algorithm's identity on this timeline. - Subsequent writers fetch and consume. Reading the Manifest yields the
spatial_index_hash; fetching that Object yields the centroids; the local algorithm instance is then trivially consistent with the publishing writer's hashes. - No re-training mid-Track. A second-generation centroid set (e.g. trained after data drift) MUST go into a new SpatialIndex Object referenced from a new Track — see
spec/0008on Track layering. Mutating an existing SI Object is FORBIDDEN.
For the training step, implementations are RECOMMENDED to use spherical k-means with k-means++ initialization, with the following determinism guarantees:
- A canonical sample-selection rule (the obvious one: the first
pca_sample_sizevectors of the dataset) plus a fixed RNG seed plus fixed iteration count MUST produce byte-identical centroids across implementations and architectures. - The scalar reference is the bit-exact source of truth; vectorized training paths MUST verify against it.
This deterministic-training contract is what makes "two writers train independently and arrive at the same SI Object" tractable. Multi-writer ceremonies that don't want to rely on this can instead designate one bootstrapper and have subsequent writers fetch — the protocol allows either pattern.
5.6.6 Sizing guidance
- k:
O(sqrt(N))is the standard k-means rule of thumb.k=1024forN=1M; scale up tok=32KforN=1B. - nprobe: trade recall for cost.
nprobe=k/32≈ 0.95 recall@10 on typical embedding data;nprobe=k/16≈ 0.97;nprobe=k/8≈ 0.99. - sample size for training: 100K vectors covers
dim ≤ 1024adequately. More samples don't materially change centroids on real-embedding data (the population already plateaus the covariance). - iterations: 20 Lloyd iterations suffice for real-embedding data; SIFT-1M centroids stabilize by iteration 10.
5.7 Extension Algorithm: dreamdb.imi-cosine
dreamdb.imi-cosine (Inverted Multi-Index, Babenko & Lempitsky 2012) is the algorithm of choice when the effective cell count k_total would otherwise exceed ~10⁵. It maps onto the same DreamDB primitives as dreamdb.ivf-cosine (one Spatial Bucket per cell, content-addressed VS Object for vector data) but factors the partitioning into two independent half-codebooks of size k_sub, giving k_sub² effective cells.
The structural problem IMI solves: at billion-scale with linear-k sizing (one cell per ~1000 vectors), pure IVF needs k_total ≈ 10⁶ centroids. Per-query argmax over 10⁶ centroids is ~30 ms scalar — it BECOMES the dominant query latency, breaking the scale-independent-latency promise. IMI's cost reduces to O(2·k_sub) = O(2·√k_total): the same 10⁶-effective-cell index needs only 2 × 1000 = 2000 centroid distances per query, on the order of 60 µs.
The same factorization applies to per-vector ingest hashing (the dominant cost during write). At billion scale, the O(N) ingest improvement compounds into days saved.
5.7.1 Params (CBOR)
The SpatialIndex Object's outer bits field MUST equal 2 × ceil(log2(k_sub)) — the spatial-key width needed to encode the combined (id_high, id_low) cell id. The outer dim field MUST be even (the half-split convention requires it).
5.7.2 Mathematical definition
Given:
- An input vector
v ∈ R^D, L2-normalized so||v|| = 1. - Two half-codebooks
C_high = {c_h_0, …, c_h_{k_sub-1}}andC_low = {c_l_0, …, c_l_{k_sub-1}}, each unit-renormalized on load.
Split v into v_high = v[0..D/2] and v_low = v[D/2..D]. Note that v_high and v_low are NOT individually unit-norm — they're literal slices of the unit-normalized full vector. This matters: the asymmetric distance computation
holds exactly because dot products distribute over concatenation. The cell selected is
5.7.3 Multi-probe via the multi-sequence algorithm
Naïve "enumerate all k_sub² cells and sort" is too expensive at large k_sub (e.g. 10⁶ enumerations at k_sub = 1000). The standard solution (Babenko & Lempitsky §4) is the multi-sequence algorithm:
- Sort the per-half dot products independently —
k_sub log k_subcost per half. - Maintain a min-heap of grid coordinates
(h, l)ordered bydots_high_sorted[h] + dots_low_sorted[l]descending. - Pop the top, emit its cell, then push
(h+1, l)and(h, l+1)if not yet visited. - Repeat
nprobetimes.
This yields the top-nprobe cells in O(nprobe · log nprobe) time without enumerating all k_sub² pairs. The reference implementation in dreamdb-protocol/src/imi_cosine.rs follows this scheme exactly.
5.7.4 Floating-point determinism
Identical to dreamdb.lsh-cosine and dreamdb.ivf-cosine (§5.4): left-fold dot product, no FMA, scalar reference path. SIMD paths MUST verify against the scalar reference per-architecture. Tiebreak rules: argmax breaks ties to the smaller centroid id; the multi-sequence heap breaks score ties by ascending (h_grid, l_grid). Both are bit-deterministic.
5.7.5 Training and the centroid-publication contract
Same as dreamdb.ivf-cosine (§5.6.5) — a single writer trains, publishes the SI Object, others fetch and consume. The training procedure is two independent spherical-k-means runs, one per half:
- Sample N vectors and split each into half-vectors (
v_high,v_low). - Run k-means independently on the two halves with derived seeds (
seed,seed ^ HALF_MIX) so the two passes don't share RNG state. - Each half-pass is byte-deterministic given fixed inputs (sample, seed, iteration count) — multi-writer reproducibility is preserved.
5.7.6 When IMI vs IVF
The decision is governed by per-query and per-vector hash cost, not recall (both algorithms hit similar recall on real-embedding data at matched effective cell counts):
| Cell budget | Choice |
|---|---|
| k_total ≤ 10⁴ (≤ ~10M vectors at 1000 vectors/cell) | dreamdb.ivf-cosine — simpler, well-understood; centroid-scan cost is negligible |
| 10⁴ < k_total < 10⁵ (~10M-100M) | Either works; IMI starts winning when per-query latency budgets get tight |
| k_total ≥ 10⁵ (~100M+) | dreamdb.imi-cosine — IVF's per-query hash cost dominates the total budget |
There is no recall reason to prefer IMI at small k. The centroid factorization introduces a small (~1-3 percentage point) recall cost vs full IVF at matched k_total because half-codebooks can't model joint correlations between the two halves of the input. For deployments where every recall point matters, stick with IVF as long as the cell budget is small enough to make centroid-scan latency acceptable.
6. Recall-Widening Strategies
Single-table LSH-cosine has limited recall at meaningful angular separations: at θ = 30° (medium similarity), (1 − 30/180)^18 ≈ 4% — single-bucket lookups miss 96% of these candidates. v0 supports four independent recall-widening levers that compose; modalities choose which to enable via parameters.
Lever priority — correction 2026-05 (added after empirical sweep at
dreamdb-bench/reports/sweep-*.json): the 2026 spec elevated read-time multi-probe (Lever 4, §6.4) from "implementation-defined query-side technique" to a first-class lever. Empirically, at fixed per-query work (~16 cells fetched),tables=1 + max_hamming=2 + probe_count=16strictly Pareto-dominatestables=4 + max_hamming=1 + probe_count=4on recall, latency, AND storage. The new default ordering is: tunemax_hammingandprobe_countfirst; reach fortables>1only when read-time probing is insufficient.
6.1 Lever 1: Prefix truncation (always available)
The SDK truncates the query's spatial key to M < N bits and issues a list-prefix (or, on the hot path, scans the cached object_index for matching prefixes — see §7.3):
returns all bucket Objects whose spatial keys begin with the M-char prefix — up to 2^(N-M) buckets. Fetch all of them, run exact comparison locally on the union.
Recall as a function of M for a candidate at angle θ: (1 − θ/π)^M.
M | Buckets fetched | Recall at θ=10° | Recall at θ=30° |
|---|---|---|---|
| 18 | 1 | 35% | 4% |
| 14 | 16 | 49% | 8% |
| 10 | 256 | 67% | 17% |
| 6 | 4 K | 85% | 35% |
| 0 | 262 K | 100% (scan) | 100% (scan) |
Prefix truncation alone is insufficient for production recall at θ ≥ 20°. It is the right lever for high-similarity (θ < 10°) lookups and as a fallback for the other levers.
6.2 Lever 2: Multi-table LSH (tables=L modality parameter)
A modality declares tables=L (e.g. embedding.f32.dim=768.bucketed.spatial-bits=18.tables=4) to opt into multi-table LSH. Writers hash each Item with L independent SpatialIndex Objects (different seeds) and place L bucket-index entries per Item. Readers query each table at the same M and union the candidate sets.
Recall improves as 1 − (1 − p)^L where p is single-table recall. At θ=30°, M=14, p≈8%:
L | Storage multiplier | Recall at θ=30°, M=14 |
|---|---|---|
| 1 | 1× | 8% |
| 4 | ~1× index, 1× data | 28% |
| 8 | ~1× index, 1× data | 49% |
| 16 | ~1× index, 1× data | 75% |
Crucial point on storage cost: vector bytes are content-addressed and stored once. Bucket Objects in multi-table mode use byte-range references (per 0007's decoupled storage layout; OQ-23) rather than inlining the vector bytes — so additional tables only multiply the index entries (~50 B per entry), not the underlying ~3 KB vector bytes.
For a 1B-vector track with dim=768 and L=4:
- Vector byte storage: 1B × 3 KB = 3 TB (single-copy)
- Bucket index entries: 1B × 4 × ~50 B = ~200 GB (4× the single-table index)
- Total ~3.2 TB vs. 3.0 TB single-table — 7% storage premium for ~3.5× recall improvement.
The reader's L SpatialIndex Objects are listed in the modality's registry entry as an array. Each table is a separate tables_id ∈ {0, …, L-1} dimension in the address grammar (per 0002 §7.3.1 spatial-bucket index entries).
6.2.1 Storage caveat: the 7% premium is the un-compacted figure
The "1B × 4 × 50 B = ~200 GB" calculation above describes the storage at ingest time, when reference-mode buckets carry references into a single shared per-Track Vector-Storage Object (per 0007 §6.2). It does NOT describe the storage after compaction (per 0006 §7.3).
Compaction rewrites each (table_id, spatial_key) cell into its own per-cell Vector-Storage Object so query bytes are contiguous. With L>1, each vector is in L cells (one per table) and therefore lands in L separate per-cell VS Objects after compaction → post-compact storage is L× the raw vector bytes, eliminating the dedup advantage that motivated reference mode in the first place.
Empirically (dreamdb-bench/reports/realistic-postcompact-final.json, 20M dim=192 tables=4): 19.4 GB ingest-time storage → 84.8 GB post-compact storage (4.4× growth, of which 4× is the inevitable per-table replication and the rest is metadata churn).
Implications:
- Reference mode at
L=1after compaction: storage stays at 1× raw + small refs. Cheapest production layout for latency-sensitive workloads. - Reference mode at
L>1after compaction: storage is L× raw — equivalent to inline mode. The "75% storage savings" headline is recovered ONLY if you skip compaction (and accept slow scattered-read queries). - Inline mode at
L>1: same L× storage, simpler operations (no compaction, no writer/compactor coordination).
Updated guidance, 2026-05: at
L>1you now choose between (a) inline mode + always-L× storage + simple operations, or (b) reference mode + L× storage post-compact + writer/compactor coordination cost (per0008§7.4). The "free dedup" framing of the 2026-04 draft was wrong for latency-sensitive ANN — surface to operators as such.
6.2.2 When multi-table is still warranted
Multi-table earns its cost only when read-time multi-probe alone (§6.4) is insufficient to reach the target recall. Sweep results at 2M dim=128 bits=14 (dreamdb-bench/reports/sweep-*.json):
| Config (per-query cells fixed at ~16) | recall@1 | latency p50 | storage |
|---|---|---|---|
tables=4, max_hamming=1, probe=4 | 0.60 | 4 ms | 4× |
tables=8, max_hamming=1, probe=2 | 0.65 | 6 ms | 8× |
tables=1, max_hamming=2, probe=16 | 0.70 | 1 ms | 1× |
At equivalent per-query work, single-table with max_hamming=2 strictly dominates multi-table on recall, latency, and storage. Multi-table becomes attractive only at very high recall targets (recall@10 > 0.95) where read-time probing saturates.
6.3 Lever 3: Write-time multi-probe (replicate-probes=k modality parameter)
A modality declares replicate-probes=k (e.g. ... .replicate-probes=4) to opt into write-time replication. At write time, the writer computes the Item's natural spatial key plus k bit-flip neighbors (typically the k bits with smallest |<v, h_i>| — the bits closest to the decision boundary, most likely to flip under perturbation). The Item's bucket-index entry is replicated across all k+1 spatial keys.
At read time, the SDK fetches only the natural-key bucket; the candidate set is automatically widened because each bucket already contains nearby boundary-case vectors from neighboring buckets.
Storage cost: k+1 bucket-index entries per Item (still ~50 B each via byte-range references); same ~7% premium per probe at billion-scale. Recall at fixed read cost approaches that of multi-probe READ-time LSH (which is a query-side technique outside the protocol).
replicate-probes=k and tables=L compose: ... .tables=4.replicate-probes=2 gives 4 × 3 = 12 index entries per Item, very high recall.
6.4 Lever 4: Read-time multi-probe (probe_count and max_hamming query parameters)
New first-class lever, 2026-05. The 2026-04 draft mentioned read-time multi-probe in §6.3 as a query-side technique "outside the protocol." Empirical evidence has elevated it to a load-bearing lever — see §6 preamble and §6.2.1.
Read-time multi-probe widens the candidate cell set at QUERY TIME instead of write time. The query verb (per 0006 §6.6) takes two parameters:
probe_count— the maximum number of cells the query inspects per table. Cap is the candidate-pool size at the chosenmax_hamming.max_hamming— the radius (in bits) within which to consider candidate cells.
Given the query's primary spatial key (the deterministic LSH output for the query vector), the candidate pool is:
max_hamming | Candidate count at bits=N | Behavior |
|---|---|---|
| 1 | 1 + N | Primary cell + each single-bit flip |
| 2 | 1 + N + N(N−1)/2 | Above + each pair-bit flip |
Candidates are ranked by ascending sum of |projection magnitudes| for the flipped bits — the bits closest to the LSH decision boundary, where the query's classification was least confident. Multi-probe probe_count=K fetches the top K of this ranked pool.
This is a strictly stronger ordering than write-time multi-probe (§6.3): write-time replication picks bit-flip candidates for each Item (independent of any future query), whereas read-time multi-probe picks them for the specific query vector using its actual projection magnitudes. The latter has more information to work with, so per-cell hits are more useful.
6.4.1 Why max_hamming is the underused dial
At bits=N=14, max_hamming=1 caps the candidate pool at 15 cells. Setting probe_count > 15 with max_hamming=1 is a no-op — the cap binds. Operators who set probe_count=64 thinking "more probes = more recall" but leave max_hamming=1 get exactly the same result as probe_count=15. This footgun is silent.
At max_hamming=2, the candidate pool grows to 15 + C(14, 2) = 106 cells. probe_count=16 now selects the 16 best-ranked candidates from a pool of 106 — far more useful than from a pool of 15. Empirically this lifts recall@1 from 0.40 to 0.70 at zero storage cost (dreamdb-bench/reports/sweep-A.json vs sweep-C.json).
Recommended defaults:
max_hamming = 2(universal default — minor compute cost, large recall lift)probe_count = 16-32(tune against target recall)tables = 1unless §6.2.2 sweep data shows tables > 1 is required for the recall target
max_hamming = 3 is permitted but produces a candidate pool of 1 + N + N(N−1)/2 + N(N−1)(N−2)/6 (e.g. 470 cells at N=14) — typically more cells than the SDK can afford to fetch, and the marginal recall gain is small. SDKs MAY support it for high-recall workloads.
6.5 SDK-side recall-widening procedure (combined)
A query for k-nearest-neighbors with target recall ρ on a modality with tables=L and replicate-probes=k, using read-time multi-probe (probe_count, max_hamming):
- Look up the modality's registry → get the
LSpatialIndex Object hashes. Fetch and cache the SpatialIndex Objects (Lsmall GETs, once per session). - Compute the query's spatial key under each of the
LSpatialIndex functions →Lprimary query-keys. - Expand each primary key into
probe_countcandidates withinmax_hammingbits (Lever 4) — ranked by ascending |projection magnitude| for flipped bits. - Estimate angular threshold
θ_maxfor the target recall. - Compute the smallest
M(prefix truncation, Lever 1) such that combined recall ≥ ρ givenL,k,probe_count,max_hamming,θ_max. (Closed-form per §6.2; SDKs may also pre-tabulate.) - For each of the
L × probe_countcandidate keys: locate matching buckets via the cachedobject_index(hot path) orlist-prefix(cold start); collect bucket addresses. - Issue parallel ranged GETs against the union of bucket addresses.
- Exact-compare query against all candidates; return top-k.
The SDK MAY iteratively widen M, probe_count, or max_hamming if the result set is too small (adaptive recall widening); the policy is implementation-defined. Recommended escalation order: probe_count → max_hamming → M → L. (Cheapest to most expensive.)
7. Worked Example: 768-dim, 18-bit, 1B Vectors
The benchmark target. End-to-end mechanics:
7.1 Setup
Modality: embedding.f32.dim=768.bucketed.spatial-bits=18.
SpatialIndex Object (CBOR):
Object size: ~80 bytes. Address: spatial-index/<33-byte-multihash>.
The Manifest's registry declares the modality → SpatialIndex Object mapping. Readers fetch this once per session, cached.
7.2 Ingestion
For each input vector v ∈ R^768:
- L2-normalize
vto unit length. - Generate the 18 hyperplanes from the seed (computed once at module load, cached).
- Compute
b_i = sign(<v, h_i>)fori ∈ {0,...,17}→ an 18-bit string. - Encode the bit string as 18 base2 chars → the spatial-key segment.
- Append
vto the Spatial Bucket Object at<timeline>/<modality>/<spatial-key>/<bucket-hash>. (The bucket hash is determined when the bucket is finalized — per0002§7.4.2.)
At 1B vectors with N=18 (~262K buckets), each bucket holds ~3,800 vectors on average → ~12 MB per bucket Object at 3 KB/vector. Sweet spot for backend GET cost.
7.3 Query — hot path vs. cold start
For a query vector q seeking the 10 nearest neighbors at recall ≥ 90%, the SDK's path differs sharply between hot path (Track Object cached) and cold start (Track Object not yet fetched). The protocol-level behavior for production is the hot path; cold-start is the bootstrap mechanism.
Hot path — Track Object's object_index already cached from a prior query:
- Look up modality → SpatialIndex Object hashes via cached Manifest registry. (No backend call.)
- L2-normalize
q. Computeq's spatial keys for each of theLtables (cached hyperplanes). - Locate matching buckets in the cached
object_index(in-memory prefix scan, microseconds). For paged-form indexes (per0002§7.3.2), traverse the cached B-tree. - Issue parallel ranged GETs against the matching bucket addresses, multiplexed over an HTTP/2 (or HTTP/3) connection (per
0005). Nolist-prefixbackend call required — bucket addresses are known from the cached index. - Decode each bucket's packed-array vectors (per
0007), exact-compare againstq, accumulate top-10. - Return the top-10 with byte-range URIs (
0002§6.5).
Cold start — Track Object not yet cached:
- Locate the Track Object via the Manifest's
tracksfield; fetch it (one GET; for paged-form Track Objects, fetch the root Index Page and the relevant subtree — typically 2–4 small GETs). - Cache the
object_indexfor the session. - Proceed as the hot path from step 2.
Latency budget at billion-scale (single-table tables=1, N=18, hot path, prefix-truncated to M=14 → ~16 buckets):
- Steps 1–3: ~1 ms (purely local, in-memory).
- Step 4: 16 parallel HTTP/2 GETs of ~12 MB each. At commodity 10 Gbps with HTTP/2 multiplexing, total wall-clock ~50–100 ms.
- Step 5: local exact-KNN over ~60K candidates at dim=768 → ~10–30 ms (CPU-bound).
- Total p50 hot-path latency: ~100–150 ms. Within reach of the sub-100 ms benchmark target with PQ compression in
0007(which shrinks bucket bytes by ~30×) or with smaller buckets atN=22.
For multi-table (tables=4), step 4 fans out to 4× more buckets but each table's recall lever is independent: M can be larger per table for the same union recall, partially offsetting the GET-count increase.
The cold-start path adds 1 round trip (~30–80 ms) for the Track Object fetch. This is paid once per Track per session — irrelevant at production query rates, where the cache is always warm. The list-prefix backend call appears nowhere on the hot path; it is only the cold-start bootstrap mechanism for SDKs without any cached Track Object at all.
8. Billion-Scale Considerations
8.1 The 100 ms target requires smaller buckets
§7.3's example fetches ~192 MB across 16 buckets to hit 90% recall at θ_max = 10°. That's far over the 100 ms p95 target.
Two levers:
- Smaller buckets. Increasing
Nfrom 18 to 22 → 4M buckets, ~250 vectors each → ~750 KB per bucket. With M=8 prefix-truncation (16 buckets), total bytes fetched ~12 MB, achievable in ~50 ms warm cache. But more buckets means more backend Objects to manage. - Compressed vectors. Storing PQ-compressed vectors inside Bucket Objects (not full f32) — e.g. 8 sub-quantizers of 256 centroids each, 8 bytes per vector instead of 3 KB. 1B vectors → 8 GB total. Each bucket: ~250 × 8 bytes = 2 KB. Recall is then a re-rank step: fetch candidates' approximate distances, identify top-K candidates, fetch their full f32 vectors via byte-range URIs for exact distance.
The second lever is what production billion-scale retrieval actually does (FAISS-IVF-PQ, ScaNN). v0's dreamdb.lsh-cosine does not include PQ compression. A v0.1 algorithm dreamdb.lsh-pq that combines LSH bucketing with PQ compression inside buckets would pin this down.
8.2 Spatial-bits sizing
N = spatial-bits controls bucket count and per-bucket density. The right N depends on item count and target recall:
| Item count | Recommended N | Buckets | Items/bucket avg |
|---|---|---|---|
| 1 M | 10–12 | ~1K–4K | ~250–1000 |
| 10 M | 14–16 | ~16K–64K | ~150–600 |
| 100 M | 16–18 | ~64K–256K | ~400–1500 |
| 1 B | 18–22 | ~256K–4M | ~250–4000 |
Rule of thumb: target ~250–4000 items per bucket. Below 250, prefix truncation has too few buckets to widen across; above 4000, exact KNN within a bucket dominates query cost.
8.3 SpatialIndex Object caching
Each SDK session loads the SpatialIndex Object (small) and the regenerated hyperplanes (~55 KB at dim=768, N=18). One-time cost per modality per session — irrelevant at billion-scale.
8.4 Hyperplane regeneration cost
ChaCha20 → 768 × 4 bytes per hyperplane × 18 hyperplanes = ~55 KB of keystream. ChaCha20 throughput is ~1 GB/s on commodity hardware → regeneration takes ~50 µs. Cached after first use.
9. Out of Scope for this Document
- Per-modality vector encoding beyond what the modality tag specifies.
embedding.f32.dim=768says "f32, 768-dim"; how the application produced the f32 values (which embedding model, whose normalization convention) is a DreamDB non-goal per0000§8. - Inside-bucket layout (packed-array headers, vector compression formats) —
0007. - Query routing across multiple Tracks of the same modality (e.g. the same embedding model layered over many videos) — that's a
0006query-operation concern. - Algorithm fitness for non-cosine metrics.
dreamdb.lsh-cosineis cosine-only; L2 metric needs a different algorithm. Future v0.1 spec addsdreamdb.lsh-l2(random projection LSH with bucket-width parameter). - Multi-table LSH. Mentioned in §6.3 as a future extension; v0 ships single-table only.
10. Open Questions Surfaced by This Document
- OQ-23 (→ 0007 §6): Bucket-internal vector encoding and storage layout. Resolved: inline-mode (default for
tables=1, fixed-size records); reference-mode (default fortables>1, byte-range refs into per-Track Vector-Storage Objects). Pure f32 in v0; PQ-compressed deferred to v0.1 (OQ-38). - OQ-24 (→ 0007 §6.4): Bucket-splitting rule. Resolved: split when a Bucket Object would exceed
bucket-max-bytes(default 100 MiB); new Bucket has same spatial key, new content hash, new index entry per0002§7.3.1's allowance for duplicatespatial_key_i. - OQ-25 (→ v0.1 spec): A
dreamdb.pq-ivfalgorithm — closer to FAISS-IVF-PQ — for production-grade recall at billion scale. v0 ships LSH-cosine as default with multi-table (§6.2) for production-grade recall; PQ-IVF is the v0.1 addition that closes the recall gap further at the cost of a training step. - OQ-26 (→ v0.1 spec): A
dreamdb.lsh-l2algorithm for L2-metric modalities. v0 cosine-only. - OQ-27 (→ 0009 §5.3): Conformance test vectors for
dreamdb.lsh-cosine. Resolved: full battery in0009§5.3.1 — multi-architecture coverage (x86 AVX-512/AVX2/scalar, ARM NEON/scalar), known-seed → known-hyperplanes, f32-rounding-boundary cases. - OQ-28 (→ 0006 §6.4 / §6.5): Adaptive recall widening + speculative preloading. Resolved as implementation-defined: SDKs MAY widen prefix-truncation depth
Magainst cached index; SDKs MAY speculatively preload on cold-start paths but MUST cancel via RST_STREAM on consumer abandonment. - OQ-29 (→ 0005 §6): HTTP/2 connection requirement. Resolved: HTTP/2 mandatory for Connectors; HTTP/1.1 fallback only with logged warning; HTTP/3 permitted as upgrade.
Next: 0005-backend-interface.md — pins the HTTP semantic contract between the Storage Connector and the Cloud Object Store: which verbs, which status codes, which conditional headers, which consistency guarantees. Resolves OQ-3, OQ-22, and the ContentStore/RefStore tier definitions referenced throughout.