DreamDBv0.2.0bec026

Spec 0015 — Hybrid Retrieval and Query Planning

Status: Draft (Phase 4 design). Depends on: spec/0001, spec/0002, spec/0004, spec/0006, spec/0010, spec/0011, spec/0013. Motivation: spec/0004–0013 give DreamDB the per-modality retrieval primitives — vector ANN (LSH/IVF/IMI/Vamana), vector compression (QINCo), scalar filters (B-tree / bitmap), federation, graph routing. Production search engines (Elastic, Vespa, Pinecone, Weaviate, Marqo) all converged on something DreamDB does not yet have: a hybrid retrieval contract that fuses lexical (BM25), dense (vector), scalar (structured), and (increasingly) multi-vector late-interaction (ColBERT) into a single ranked list per query — with a query planner that decides which indexes to consult, in what order, with what cost budget. Without this layer, DreamDB is "a strong collection of indexes" rather than "a search engine." This spec closes that gap.


1. Purpose

Real-world retrieval queries are not pure: "DreamDB protocol changes since 2026" needs an exact-phrase match on "DreamDB," a semantic match on "protocol changes," and a time filter on "since 2026." Today each goes through a different verb on a different index; the application is responsible for fusing the results. That's correct as a separation of concerns but inadequate as a default — applications routinely get the fusion wrong (incompatible score scales, double-counted duplicates, blown latency budgets).

By the end of this document the following are concrete:

  • The TextIndex Object: a new ObjectKind that holds inverted-index posting lists for lexical retrieval. Algorithm registry adds dreamdb.bm25, dreamdb.bm25-plus, and OPTIONAL dreamdb.splade-cosine for learned sparse retrieval.
  • The MultiVectorIndex Object: ObjectKind that supports ColBERT-style late-interaction retrieval (per-token vectors with MaxSim aggregation).
  • The HybridQuery verb: extends spec/0006 §4.3's Query to accept multi-modality sub-queries with fusion policy.
  • The Reciprocal Rank Fusion (RRF) default: scale-invariant score combination; the v0.X production default.
  • The Query Planner: cost-model + execution strategy. Decides pre-filter-vs-post-filter for scalar+vector composition, chooses which TextIndex/SpatialIndex/GraphIndex to consult, batches I/O, applies early-termination.
  • The pre-filter vs post-filter contract: when a query has a scalar predicate AND a vector predicate, which runs first is a major optimization. spec/0015 pins the planner's contract so two implementations make the same choice given the same statistics.

What stays defined elsewhere:

  • Per-modality storage layouts — spec/0007, spec/0010, spec/0013.
  • Manifest registry, capability tokens — spec/0002, spec/0012.
  • The Query verb's general shape — spec/0006.

What this document does NOT define:

  • Question answering / generation. Retrieval ends with a ranked list; the application's LLM / answer generator is downstream.
  • Personalization / behavioral signals. Click-through reweighting is application-layer.
  • Cross-Track joins. Joins are out of scope; the application materializes them.
  • Approximate query languages (SQL-like, GraphQL-like). HybridQuery is structured-CBOR; pretty syntax is SDK ergonomics.

2. The retrieval surface

SurfaceWhat it answersExisting indexNew index (this spec)
DenseSemantic similarity over embeddingSpatialIndex (0004) / GraphIndex (0013)
Dense compressedSame, with PQ/QINCo codesVectorCompressor (0010)
Scalarlabel = "cat", time ∈ [a, b], score > 0.5ScalarIndex (0011)
LexicalBM25 / token-match for proper nouns / codesTextIndex (§3)
Multi-vectorColBERT MaxSim over per-token vectorsMultiVectorIndex (§4)
Hybrid (fusion)Combine 2+ of the aboveHybridQuery (§5)

Every individual surface remains its own ObjectKind with its own verbs. The new layer is the HybridQuery composition plus the new lexical and multi-vector ObjectKinds.

3. The TextIndex Object

A TextIndex Object holds an inverted index for a text-bearing modality.

3.1 Modality string

Text modalities follow the existing grammar:

text.utf8                                                  ;; bare text
text.utf8.tokenizer=bert-base.bm25                         ;; tokenized + bm25 index
text.utf8.tokenizer=bert-base.bm25-plus                    ;; bm25+
text.utf8.tokenizer=bert-base.splade=naver/splade-v2       ;; learned sparse

The .bm25 / .bm25-plus / .splade=… suffix is OPTIONAL. Without it, text is stored as Constant or Time-bucketed Discrete Event items (per existing modality table); presence enables index publication.

3.2 Address path

<timeline>/<modality>/text-index/<text-index-hash>

(New per-Timeline slot, parallel to spatial-index/ and graph-index/.)

3.3 CBOR encoding

{
  "algorithm":      "<algorithm-id>",            ;; "dreamdb.bm25", "dreamdb.bm25-plus", "dreamdb.splade-cosine"
  "tokenizer": {
    "id":           "<tokenizer-id>",            ;; "bert-base", "xlm-roberta", "dreamdb.utf8-words"
    "vocab_hash":   <multihash>,                 ;; content hash of the tokenizer vocab Object
  },
  "stats": {
    "doc_count":      <unsigned int>,            ;; N
    "total_tokens":   <unsigned int>,            ;; sum of doc lengths
    "avg_doc_len":    <f32-as-bytes>,            ;; for BM25 length normalization
  },
  "params":         <algorithm-specific>,        ;; §3.4
  "posting_root":   <multihash>,                 ;; root of paged posting-list B-tree
  "doc_norms": {                                 ;; optional; absent for BM25 (length is in stats)
    "norms_root":   <multihash>,                 ;; root of per-doc norm pages (for SPLADE)
  },
}

3.4 Algorithm-specific params

dreamdb.bm25 and dreamdb.bm25-plus

{
  "version":     1,
  "k1":          <f32-as-bytes>,                 ;; default 1.5
  "b":           <f32-as-bytes>,                 ;; default 0.75
  "delta":       <f32-as-bytes>,                 ;; bm25+ only; default 1.0
}

dreamdb.splade-cosine

{
  "version":          1,
  "encoder_hash":     <multihash>,               ;; the SPLADE encoder Object (a VectorCompressor-style weights blob)
  "max_active_dims":  <unsigned int>,            ;; sparsity ceiling per doc
  "metric":           "cosine",
}

SPLADE produces a sparse |V|-dim vector per doc; the index is effectively an inverted file keyed by vocab dimension. Same posting-list machinery as BM25, just different per-document weights.

3.5 Posting-list pages

Posting lists for high-volume tokens exceed inline budget. Following spec/0002's paged-index pattern:

;; Leaf posting page (per term)
{
  "term_id":       <unsigned int>,
  "page_index":    <unsigned int>,
  "entry_count":   <unsigned int>,
  "entries": [
    [<doc_id: u64>, <tf: u32>, <positions: bytes | null>],
    …
  ],
}

Address: <timeline>/<modality>/text-index/posting/<page-hash>.

tf is term frequency in the doc; positions is OPTIONAL byte-packed list of in-doc token positions for phrase queries. For BM25 without phrase support, omit positions to halve posting-list size.

3.6 BM25 scoring

For a query Q and document d:

BM25(Q, d) = Σ_{t in Q}  IDF(t) · ( tf(t, d) · (k1 + 1) ) /
                                   ( tf(t, d) + k1 · (1 - b + b · |d|/avgdl) )

IDF(t) = ln( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 )

dreamdb.bm25-plus adds + delta to each per-term contribution (Yang & Lin 2011); recommended for long-doc collections.

Scoring is computed by the SDK after fetching the relevant posting-list pages. Implementations MUST use the same fp32 left-fold discipline as spec/0004 §5.4.1 to guarantee bit-identical scores across implementations (otherwise top-K rankings drift across SDKs for tied scores).

3.7 Registry reference

The Manifest registry's per-modality entry for an indexed text modality:

"registry": {
  "text.utf8.tokenizer=bert-base.bm25": {
    "kind":         "discrete",
    "object_kind":  "time-batch",
    "algorithm":    "dreamdb.bm25",
    "text_index":   <multihash-of-TextIndex-Object>,
  }
}

Standard SpatialIndex / ScalarIndex / VectorCompressor lineage validation (per 0004 §3.3.1, 0010 §3.3) applies symmetrically. SDKs MUST validate algorithm match and tokenizer-vocab-hash match before scoring.

4. The MultiVectorIndex Object

A MultiVectorIndex Object supports ColBERT-style retrieval: each Item is represented by N vectors (typically per-token contextualized embeddings), and similarity to a query (also N vectors) is the MaxSim aggregation:

MaxSim(Q, D) = Σ_i  max_j  <q_i, d_j>

Late interaction beats single-vector retrieval substantially on long documents (Khattab & Zaharia 2020, BEIR benchmark) at the cost of N× storage.

4.1 Modality string

embedding.f32.dim=128.multi-vec.max-tokens=256[.compress=qinco:M=4]

multi-vec is the marker; max-tokens is the per-doc token cap. Compression composes per spec/0010.

4.2 Storage layout

A MultiVectorIndex holds a single SpatialIndex (any spec/0004 or spec/0013 algorithm) over all token-vectors in the corpus. Lookup proceeds as:

  1. Compute the N query-token vectors.
  2. For each, do a coarse ANN over the global token index → top-K_tok candidates.
  3. Group results by parent doc-id (each token-vector record carries doc_id).
  4. For each candidate doc, fetch the doc's full token-vector set; compute exact MaxSim.
  5. Re-rank by MaxSim; return top-K_doc.

Bucket records carry an extra doc_id: u64 alongside the existing time_anchor: u64:

record_size = 8 (time_anchor) + 8 (doc_id) + code_bytes

Bucket header gains multi_vec_index_hash: 33 bytes (a MultiVectorIndex Object hash; analog of spatial_index_hash per spec/0007 §6.1.1). Header grows to 240 bytes when the field is present.

4.3 Address path

<timeline>/<modality>/multi-vec-index/<hash>

CBOR encoding mirrors a SpatialIndex Object with an additional parent_doc_field declaring which Track holds the doc_id → parent-Item resolution.

5. The HybridQuery verb

Extends spec/0006 §4.3 Query. A HybridQuery accepts a structured spec describing per-modality sub-queries and the fusion policy.

5.1 Query spec (CBOR)

{
  "version":       1,
  "track_selector": <TrackSelector sub-Object>,   ;; logical Track this query targets; see §5.1.1
  "sub_queries": [
    {
      "label":        "lex",                      ;; query-local identifier
      "modality":     "<modality-tag>",
      "kind":         "text" | "vector" | "scalar" | "multi-vec",
      "query":        <kind-specific sub-Object>,
      "k_local":      <unsigned int>,             ;; per-sub-query top-K (optional; default 100)
      "weight":       <f32-as-bytes>,             ;; for linear fusion (optional; ignored for RRF)
      "required":     <bool>,                     ;; if true, candidate MUST appear here to survive
    },
    …
  ],
  "fusion": {
    "policy":     "rrf" | "linear" | "max" | "pareto",
    "params":     <policy-specific>,              ;; e.g. RRF k constant
  },
  "k_final":      <unsigned int>,                 ;; final top-K to return
  "filters": [                                    ;; optional pre-filters; applied to whole query
    { "modality": "<scalar-modality>",
      "predicate": <predicate-CBOR> },
    …
  ],
  "budgets": {                                    ;; optional cost ceilings
    "max_latency_ms":   <unsigned int>,
    "max_bytes":        <unsigned int>,
    "max_compute_units": <unsigned int>,
  }
}

5.1.1 TrackSelector sub-Object

{
  "track_ref":          <multihash | string>,        ;; direct Track Object reference (legacy v0 shape)
  "logical_concept":    "<string> | null",            ;; OPTIONAL; per spec/0017 §4.1 multi-version selection
  "version_preference": "latest" | "all" | "<version-spec>" | null,   ;; OPTIONAL; per spec/0017 §4.1
}

A v0 query that targets a single Track passes only track_ref; the other two fields are absent or null. spec/0017 multi-version queries use logical_concept + version_preference (with track_ref = null), letting the planner enumerate matching versions from the Manifest registry rather than addressing a specific Track. The planner MUST accept both shapes — track_ref-only is the v0 fast path; logical_concept-only is the migration-aware path; both populated is malformed.

5.2 Fusion policies

Reciprocal Rank Fusion (RRF) — default

score_RRF(d) = Σ_i  1 / (k + rank_i(d))

where k = 60 by convention; rank_i(d) is d's rank in sub-query i (∞ if absent). RRF is scale-invariant — sub-queries with different score ranges (BM25 ~[0, 40]; cosine ~[0, 1]) combine sensibly without normalization. Cormack et al. 2009 + a decade of TREC track results show RRF is hard to beat without tuning.

Linear

score_linear(d) = Σ_i  weight_i · normalize_i(score_i(d))

Normalization is per-sub-query min-max over the candidate pool. Linear lets operators encode known preferences ("dense matters 2× more than lexical") but requires per-collection tuning. NOT the default.

Max

score_max(d) = max_i  normalize_i(score_i(d))

Cheap fallback; effectively "best signal wins." Useful for "lexical OR semantic" queries.

Pareto

Return the union of each sub-query's top-K_local; client takes responsibility for re-ranking. Bypasses score-combination entirely. Use when the application has its own re-ranker (e.g., a cross-encoder).

5.3 Required sub-queries

A sub-query with required: true becomes a filter — candidates absent from its top-K_local are eliminated from the final pool. This is how spec/0015 encodes "must match the lexical phrase AND be semantically close" without nested boolean syntax.

5.4 Latency-vs-recall budget

budgets.max_latency_ms lets the planner short-circuit. If projected total cost exceeds the budget, the planner reduces K_local in the cheapest sub-query first (typically the scalar filter), then the costliest (graph search at high L_search), and surfaces a degraded: true flag in the result.

6. The query planner

The planner translates a HybridQuery spec into an execution plan: an ordered list of index accesses with batched I/O and parallel-where-possible scheduling.

6.1 Pre-filter vs post-filter (the central decision)

When a query has a scalar filter (e.g., label = "cat") AND a vector sub-query, the planner chooses one of:

StrategyStepsBest when
Pre-filter1. Scalar lookup → candidate doc-id set; 2. Vector search RESTRICTED to setScalar is highly selective (<1%)
Post-filter1. Vector ANN → top-K_oversampled; 2. Filter by scalar; 3. Re-rankScalar is mildly selective (>1%)
Joint (future)A scalar-aware vector index (e.g., label_id participating in spec/0004 spatial-key derivation) gives O(1) intersectSingle field, very high selectivity. Deferred — spec/0004 v0 does not yet support hybrid spatial keys.

The threshold-of-1% is a planner heuristic, not a contract. The contract is:

  • The planner MUST estimate selectivity from the ScalarIndex Object (count of matching doc-ids vs total).
  • The planner MUST be deterministic given the same statistics — two implementations choose the same strategy for the same query. This makes performance reproducible.

6.2 Cost model

Each strategy is costed by:

cost = α · expected_GET_bytes + β · CPU_score_ops + γ · network_RTT_count

α, β, γ are SDK-level constants tuned per deployment (bandwidth, CPU speed, network latency). The constants are exposed as SDK config but the decision function consuming them is normative (per §6.1).

6.3 Plan stability (mandatory)

Per-query plans MUST be:

  • Deterministic given the same Manifest, ScalarIndex statistics, and HybridQuery spec.
  • Logged by the SDK at debug level: plan tree, estimated cost, actual cost. Operators rely on plan stability for capacity planning and regression detection.

6.4 Adaptive execution

Within a plan, the SDK MAY adapt:

  • If pre-filter returns 0 candidates → fall back to post-filter (the scalar predicate was wrong or stale).
  • If vector ANN returns 0 candidates above similarity floor → widen search-list / probe count once before giving up.
  • If an HTTP/2 stream stalls → time out and retry against a federated mirror (per spec/0012).

Adaptive execution is implementation-defined for the retry tactics but mandatory for the escalation order: cheaper-first widens first. Same discipline as spec/0004 §6.5 recall-widening escalation.

7. Hybrid query latency at 1B-scale

Worked example. Modality = text.utf8.tokenizer=bert-base.bm25 + embedding.f32.dim=768.bucketed.spatial-bits=18.compress=qinco:M=8. Corpus = 1B docs. Query = "dreamdb protocol" AND semantically similar to <query embedding> AND time > 2026-01-01.

Planner output (deterministic given stats):

1. ScalarIndex(time > 2026-01-01)  →  ~50M doc-id set  (5% selectivity; estimate)
2. TextIndex("dreamdb" ∧ "protocol")  →  ~10K doc-ids matching both terms
   (cheap; posting lists are tiny for these tokens)
3. Intersect (1) ∩ (2)  →  ~600 doc-ids
4. SpatialIndex (IMI+QINCo)  →  100 nearest doc-ids
5. Re-rank by RRF((2), (4))  →  top-10

Latency:

  • Step 1: 1 GET of ScalarIndex root + tree walk → ~20 ms warm.
  • Step 2: 2 GETs of posting-list pages → ~30 ms warm.
  • Step 3: in-memory set intersection → <1 ms.
  • Step 4: 16 GETs of Spatial Bucket Objects (QINCo-compressed; ~512 KB total) → ~30 ms warm.
  • Step 5: fusion + final ranking → <1 ms.
  • Total p50: ~80 ms warm. Single-backend; federation scatter-gather adds the spec/0012 §6 fan-out.

Cold-start adds ~100 ms (one Manifest fetch, one Track Object root fetch). Within the same budget as a single-modality query — fusion overhead is negligible.

8. Conformance categories (per spec/0009 §8.6.1)

CategoryPass criterionCoverage
hybrid.rrf.scale-invariance.*Same RRF score regardless of sub-query score scalesBM25+cosine; cosine+bm25-plus
hybrid.planner.deterministic.*Same Manifest + stats + query → same planMulti-implementation
hybrid.prefilter.threshold.*Selectivity <1% → pre-filter chosen; >1% → post-filterAdversarial stats
hybrid.required.elimination.*required:true sub-query eliminates non-matching candidatesBoolean-AND semantics
hybrid.bm25.fp32-determinism.*BM25 scores bit-identical across implementationsScalar reference vectors
hybrid.colbert.maxsim.*MaxSim aggregation matches reference (within fp32 left-fold discipline)dim=128 reference corpus
hybrid.splade.encoder-validation.*TextIndex with SPLADE algorithm validates encoder_hash against registryEncoder mismatch → critical error

9. Composition with existing specs

Existing primitiveComposition
spec/0004 (SpatialIndex)Dense sub-query backend; HybridQuery dispatches to existing Query verb
spec/0010 (VectorCompressor)All sub-queries inherit compression transparently
spec/0011 (ScalarIndex)Scalar sub-query AND scalar filter share the same B-tree / bitmap indexes
spec/0012 (Federation)HybridQuery scatter-gathers per-modality; merge is sub-query-aware
spec/0013 (Vamana)Dense sub-query backend (interchangeable with SpatialIndex)
spec/0014 (chunked Items)Sub-query results are Item IDs; stitching handled by per-Item read path

The composition story is clean by design: HybridQuery is a plan over existing primitives, not a new physical layer.

10. Out of scope

  • Cross-encoders / neural re-rankers. Late re-ranking with a transformer is application-layer; HybridQuery returns a ranked candidate set; the application's re-ranker downstream is its own concern.
  • Personalization signals. User-specific weights, click-through reweighting — out.
  • Query expansion (synonyms, hypernyms). Tokenizer-layer concern; the application can call its expander before submitting the HybridQuery.
  • Native graph queries. "Find docs cited by X" requires graph traversal of citation links; out of v0.X spec.

11. Open questions

  • OQ-62 (→ this spec): Should TextIndex carry full positional postings by default, or only term-frequency? Phrase queries need positions; recall-only retrieval doesn't. Defaulting to positional doubles index size. Recommendation: default OFF; opt-in via text.utf8.tokenizer=bert-base.bm25.positions=on.
  • OQ-63 (→ this spec): Planner cost-model constants (α, β, γ). Operator-tunable or SDK-fixed? Probably operator-tunable with sane defaults; documented in spec/0006 §3.X amendment.
  • OQ-64 (→ this spec): SPLADE vs ColBERT as the v0.X learned-retrieval default. Both work; SPLADE composes more cleanly with existing inverted-index machinery. Lean SPLADE; ColBERT as opt-in.
  • OQ-65 (→ spec/0009): Conformance test vectors for hybrid planning. Block v0.X release on a representative benchmark (BEIR or MS-MARCO slice).
  • OQ-66 (→ spec/0006): Add HybridQuery as a 10th verb or fold into Query? Resolved: spec/0006 §2.2.1 folds HybridQuery into the existing Query verb — the structured query_spec payload defined in §5.1 of this document is the new richer shape. The eight-verb taxonomy of spec/0006 §2 is unchanged.

Next: spec/0016 — streaming updates and real-time freshness (closes the continuous-ingest gap that hybrid retrieval surfaces — without sub-second freshness, "search what I just wrote" doesn't work).