Spec 0011 — Scalar Indexing (Native Structured-Metadata Filters)

Status: Draft (Phase 2 design). Depends on: spec/0001, spec/0002, spec/0004, spec/0007. Motivation: design/0001-dataset-platform.md — ML training pipelines need to filter samples by structured metadata (WHERE label='cat') alongside vector similarity and time range. DreamDB v0 has the latter two; scalar filtering is the missing piece. This spec defines the modality that closes the gap.

1. Purpose

The vector- and time-filter paths in v0 (spec/0004 + time-bucket addresses) both rely on a content-addressed index that lives alongside the bucket data: vector queries hit a SpatialIndex Object (LSH/IVF/IMI centroids); time queries descend the Track's paged time-bucket index.

For structured fields (label = 'cat', region IN ('us', 'eu'), confidence > 0.9, etc.) DreamDB has no analogous index today. The naïve fallback is full-scan over a Discrete Event Track, which works to ~10M rows and breaks beyond. This spec defines the index objects, Track-side bucket layout, and query semantics that make scalar filters scale-invariant in the same sense vector filters are: per-query work is bounded by the result-set size, not the corpus size.

2. Scope

In scope:

A new family of algorithm IDs under dreamdb.btree-* and dreamdb.bitmap-*.
A new InlineObjectIndex::ScalarBucket variant for Track Objects of scalar modalities.
A new bucket layout for the data side (sorted (value, sample-ref) pairs for B-tree; sample-id bitsets per value for bitmap).
Multi-version semantics: how the index handles overwrites in an append-only timeline.
Sizing guidance (when to pick bitmap vs B-tree, how to scale across N).

Out of scope (deferred to future revisions):

Full-text / token indexes. Documents under text.utf8 Tracks would need their own inverted-index modality; covered by spec/0012 (not yet drafted).
Compound indexes ((field_a, field_b) jointly indexed). Single-field for v0; compound is a future addition.
User-defined comparators for string ordering. We use byte-lexical order; locale-aware collation is application-level.

3. Algorithm registry additions

Extends spec/0004 §3.4. New built-in algorithms:

Algorithm ID	Indexed type	Backing structure	Range queries	Equality queries
`dreamdb.btree-int64`	i64 + Timestamp	Sorted (i64, sample-id) leaves	O(log N)	O(log N)
`dreamdb.btree-float64`	f64	Sorted (f64, sample-id) leaves	O(log N)	O(log N)
`dreamdb.btree-string`	UTF-8 string	Sorted (bytes, sample-id) leaves	O(log N) prefix	O(log N)
`dreamdb.bitmap-categorical`	Low-cardinality string	One Roaring bitmap per value	not supported	O(1) per value
`dreamdb.bitmap-bool`	bool	Two Roaring bitmaps (true / false)	n/a	O(1)

Same identifier grammar and registry contract as spec/0004 §3.4. Per-modality registry entries point at the ScalarIndex Object's hash, mirroring the SpatialIndex Object pattern.

4. The ScalarIndex Object

Symmetric with spec/0004 §3's SpatialIndex Object. Carries algorithm parameters; content-addressed; immutable.

4.1 CBOR encoding

{
  "algorithm":  "<id>",          ;; one of the IDs in §3
  "field_name": "<text>",        ;; logical field name (Schema-level)
  "metric":     "<id>",          ;; "ordinal" for B-tree, "categorical" for bitmap
  "params":     <sub-object>,    ;; algorithm-specific (often empty)
}

field_name is not strictly needed for index lookup (the modality string carries the discriminator), but is useful for human readability and for cross-referencing the Dataset's Schema. Address: scalar-index/<multihash-of-canonical-CBOR-bytes>.

4.2 Params

`dreamdb.btree-*`

{
  "version": 1,
  "leaf_fanout":     <uint>,    ;; entries per leaf page (default 4096)
  "internal_fanout": <uint>,    ;; children per internal page (default 256)
}

The B-tree is realized as a chain of Index Pages per spec/0002 §7, parallel to how paged Track indexes work today. Leaf pages hold (field_value, sample_anchor) pairs sorted by field_value then by sample_anchor.

`dreamdb.bitmap-*`

{
  "version": 1,
  "cardinality_limit": <uint>,  ;; max distinct values; default 65536
}

The bitmap index is materialized as one Roaring bitmap per distinct value. Bitmaps are themselves content-addressed objects under scalar-bitmap/<multihash>. The ScalarIndex Object MAY include a small inline value → bitmap-hash map for low-cardinality cases; otherwise the mapping is paged like other big indexes.

5. Track-side layout

A scalar-indexed Track's Object Index is InlineObjectIndex::ScalarBucket(_) or its paged counterpart, parallel to today's SpatialBucket:

rust

pub enum InlineObjectIndex {
    Fragment(Vec<FragmentEntry>),
    SpatialBucket(Vec<SpatialBucketEntry>),
    TimeBatch(Vec<TimeBatchEntry>),
    ScalarBucket(Vec<ScalarBucketEntry>),  // NEW
}

pub struct ScalarBucketEntry {
    /// The field value (CBOR-encoded so heterogeneous concrete types
    /// share one entry shape).
    pub value: ciborium::Value,
    /// Inclusive time-anchor range covered by this bucket.
    pub t_start: u64,
    pub t_end: u64,
    /// The bucket's content hash.
    pub bucket_address: Multihash,
}

Bucket records inside each bucket are (sample_anchor, vector_or_blob_ref) tuples — same as Reference-mode SpatialBuckets — sorted by sample_anchor within a bucket so range scans within a single value are sequential I/O.

For the bitmap variant, "bucket" is misleading — each "bucket address" actually points at the Roaring bitmap's bytes. Query path: load bitmap, AND with whatever other filters' bitmaps are in play, materialize sample anchors.

6. Query semantics

The verb layer adds a new query primitive:

rust

pub async fn query_scalar(
    session: &Session,
    timeline: &Multihash,
    modality: &ModalityTag,
    scalar_index_hash: &Multihash,
    predicate: ScalarPredicate,
    t_start: u64,
    t_end: u64,
) -> Result<Vec<SampleAnchor>, ScalarQueryError>;

ScalarPredicate mirrors the Filter::Where/ScalarOp from dreamdb-dataset:

Eq(v) / Neq(v) — single-value lookup (B-tree: point search; bitmap: load bitmap for value).
Lt(v) / Lte(v) / Gt(v) / Gte(v) — B-tree range scan; bitmap: NotSupported error.
In([v1, v2, ...]) — bitmap: union of per-value bitmaps; B-tree: O(k log N).

t_start/t_end is intersected with the returned anchors at the bucket-entry level (time bounds on each ScalarBucketEntry let us skip whole buckets that fall outside the window).

The Dataset-layer planner uses this to evaluate Filter::Where clauses, then intersects with the vector/time results.

7. Multi-version semantics

DreamDB Tracks are append-only. If a sample's scalar field is "updated" — e.g., label changes from "cat" to "dog" — the new value is written at a fresh time anchor. The scalar index will then contain entries for both the old and new value at different anchors.

Query semantics:

Default (latest-as-of-version): the Manifest's "as-of" view (per spec/0008) determines which Tracks are visible. Within visible Tracks, the most recent entry per sample_anchor wins. This is a per-sample-id "latest" resolution done at read time.
All-versions: an explicit flag include_overwritten: true returns every entry across all versions. Useful for audit / time-travel queries.

Tombstones (sample deletion) are out of scope for v0.1. The natural extension: a sentinel entry with value = null masks all prior entries for that sample.

8. Sizing guidance

Field shape	Cardinality	Algorithm
Categorical labels (class, region, country)	< 10K distinct values	`dreamdb.bitmap-categorical`
Boolean flags (split, is_train)	2	`dreamdb.bitmap-bool`
Numeric scores, timestamps, prices	> 10K distinct values	`dreamdb.btree-{int64,float64}`
Free-text searchable strings	unbounded	`dreamdb.btree-string` (lex order); full-text deferred

ImageNet-100's label (100 distinct values) and split (2 values) both fit the bitmap path cleanly. A future "ResearchDataset" with 10K-class taxonomy would still fit (bitmap-categorical cardinality_limit = 65536 covers it). Anything denser uses B-tree.

The cardinality limit on dreamdb.bitmap-categorical is intentional: above ~64K distinct values, bitmaps stop being more efficient than B-tree leaves. The decoder rejects an index that exceeds the limit; the producer must re-train as B-tree.

9. Implementation roadmap

Mirrors the IVF rollout shape from spec/0004 §5.6:

Step	Crate / file	What it adds
1	`dreamdb-protocol/src/scalar_index.rs`	`ScalarIndexObject` + CBOR + tests, parallel to `SpatialIndexObject`
2	`dreamdb-protocol/src/bitmap_categorical.rs`	First algorithm impl (matches ImageNet-100 `label`); uses Roaring bitmaps
3	`dreamdb-protocol/src/track.rs`	New `InlineObjectIndex::ScalarBucket` variant + paged-tree extension
4	`dreamdb-protocol/src/verbs/query_scalar.rs`	New verb
5	`dreamdb-bench/src/ivf_training.rs` (no — rename to `index_training.rs`)	Bitmap build step; bench validation
6	`dreamdb-dataset/src/filter.rs`	Lift the Phase-1 unsupported-filter gate; plumb `Where` clauses through
7	B-tree algorithms	`dreamdb.btree-int64`, `dreamdb.btree-float64`, `dreamdb.btree-string` (separate PR)

Steps 1-6 are the ImageNet-100 happy path (categorical labels). Step 7 generalizes to numeric/lexical.

10. Open questions

Sample-id model coupling. This spec assumes sample anchors are byte-comparable u64s — DreamDB's existing time-anchor semantics. If design/0001 ever introduces an explicit sample-id join Track (the deferred Phase 1 question), the bucket-record shape needs a per-record sample-id field. Resolution: defer until the join-Track decision is made; bitmaps internally already abstract over the "sample identifier" concept.
Roaring bitmap library choice. Rust ecosystem has roaring (pure Rust, mature) and croaring (C bindings). v0.1 picks roaring; revisit if benchmarks show > 2× perf gap.
Bitmap compression on the wire. Roaring serializes itself; we add nothing on top. Decoder reads the serialized bytes via roaring::RoaringBitmap::deserialize_from. ETag stability across writers requires Roaring's serialization to be byte-deterministic — needs verification (Roaring's standard format is deterministic; double-check during impl).
Bucket address vs inline bitmap. For very small bitmaps (< 1 KB) inlining into the ScalarBucketEntry saves a fetch. Decision: inline when bitmap.serialized_size() < INLINE_THRESHOLD (default 1024 bytes); the ScalarBucketEntry.bucket_address becomes a magic-zero value when inlined and the bytes live in a sibling inline_bytes field. Codified in §5 with a new variant.

11. Conformance

Adds a new test category to spec/0009 §5:

5.4. Scalar Index — ScalarIndex Object CBOR round-trip per algorithm; bucket-record sort order; bitmap union/intersection determinism; B-tree leaf-page paging at boundary sizes; multi-version latest-wins semantics.

Cross-implementation test vectors (the per-algorithm fixtures) live in dreamdb-conformance/fixtures/scalar/. Producers MUST hash-match every fixture's outputs (Index Object, bucket entries, returned anchors) before the implementation is certified.