Spec 0011 — Scalar Indexing (Native Structured-Metadata Filters)
Status: Draft (Phase 2 design).
Depends on: spec/0001, spec/0002, spec/0004, spec/0007.
Motivation: design/0001-dataset-platform.md — ML training pipelines need to filter samples by structured metadata (WHERE label='cat') alongside vector similarity and time range. DreamDB v0 has the latter two; scalar filtering is the missing piece. This spec defines the modality that closes the gap.
1. Purpose
The vector- and time-filter paths in v0 (spec/0004 + time-bucket addresses) both rely on a content-addressed index that lives alongside the bucket data: vector queries hit a SpatialIndex Object (LSH/IVF/IMI centroids); time queries descend the Track's paged time-bucket index.
For structured fields (label = 'cat', region IN ('us', 'eu'), confidence > 0.9, etc.) DreamDB has no analogous index today. The naïve fallback is full-scan over a Discrete Event Track, which works to ~10M rows and breaks beyond. This spec defines the index objects, Track-side bucket layout, and query semantics that make scalar filters scale-invariant in the same sense vector filters are: per-query work is bounded by the result-set size, not the corpus size.
2. Scope
In scope:
- A new family of algorithm IDs under
dreamdb.btree-*anddreamdb.bitmap-*. - A new
InlineObjectIndex::ScalarBucketvariant for Track Objects of scalar modalities. - A new bucket layout for the data side (sorted (value, sample-ref) pairs for B-tree; sample-id bitsets per value for bitmap).
- Multi-version semantics: how the index handles overwrites in an append-only timeline.
- Sizing guidance (when to pick bitmap vs B-tree, how to scale across N).
Out of scope (deferred to future revisions):
- Full-text / token indexes. Documents under
text.utf8Tracks would need their own inverted-index modality; covered byspec/0012(not yet drafted). - Compound indexes (
(field_a, field_b)jointly indexed). Single-field for v0; compound is a future addition. - User-defined comparators for
stringordering. We use byte-lexical order; locale-aware collation is application-level.
3. Algorithm registry additions
Extends spec/0004 §3.4. New built-in algorithms:
| Algorithm ID | Indexed type | Backing structure | Range queries | Equality queries |
|---|---|---|---|---|
dreamdb.btree-int64 | i64 + Timestamp | Sorted (i64, sample-id) leaves | O(log N) | O(log N) |
dreamdb.btree-float64 | f64 | Sorted (f64, sample-id) leaves | O(log N) | O(log N) |
dreamdb.btree-string | UTF-8 string | Sorted (bytes, sample-id) leaves | O(log N) prefix | O(log N) |
dreamdb.bitmap-categorical | Low-cardinality string | One Roaring bitmap per value | not supported | O(1) per value |
dreamdb.bitmap-bool | bool | Two Roaring bitmaps (true / false) | n/a | O(1) |
Same identifier grammar and registry contract as spec/0004 §3.4. Per-modality registry entries point at the ScalarIndex Object's hash, mirroring the SpatialIndex Object pattern.
4. The ScalarIndex Object
Symmetric with spec/0004 §3's SpatialIndex Object. Carries algorithm parameters; content-addressed; immutable.
4.1 CBOR encoding
field_name is not strictly needed for index lookup (the modality string carries the discriminator), but is useful for human readability and for cross-referencing the Dataset's Schema. Address: scalar-index/<multihash-of-canonical-CBOR-bytes>.
4.2 Params
dreamdb.btree-*
The B-tree is realized as a chain of Index Pages per spec/0002 §7, parallel to how paged Track indexes work today. Leaf pages hold (field_value, sample_anchor) pairs sorted by field_value then by sample_anchor.
dreamdb.bitmap-*
The bitmap index is materialized as one Roaring bitmap per distinct value. Bitmaps are themselves content-addressed objects under scalar-bitmap/<multihash>. The ScalarIndex Object MAY include a small inline value → bitmap-hash map for low-cardinality cases; otherwise the mapping is paged like other big indexes.
5. Track-side layout
A scalar-indexed Track's Object Index is InlineObjectIndex::ScalarBucket(_) or its paged counterpart, parallel to today's SpatialBucket:
Bucket records inside each bucket are (sample_anchor, vector_or_blob_ref) tuples — same as Reference-mode SpatialBuckets — sorted by sample_anchor within a bucket so range scans within a single value are sequential I/O.
For the bitmap variant, "bucket" is misleading — each "bucket address" actually points at the Roaring bitmap's bytes. Query path: load bitmap, AND with whatever other filters' bitmaps are in play, materialize sample anchors.
6. Query semantics
The verb layer adds a new query primitive:
ScalarPredicate mirrors the Filter::Where/ScalarOp from dreamdb-dataset:
Eq(v) / Neq(v)— single-value lookup (B-tree: point search; bitmap: load bitmap for value).Lt(v) / Lte(v) / Gt(v) / Gte(v)— B-tree range scan; bitmap: NotSupported error.In([v1, v2, ...])— bitmap: union of per-value bitmaps; B-tree: O(k log N).
t_start/t_end is intersected with the returned anchors at the bucket-entry level (time bounds on each ScalarBucketEntry let us skip whole buckets that fall outside the window).
The Dataset-layer planner uses this to evaluate Filter::Where clauses, then intersects with the vector/time results.
7. Multi-version semantics
DreamDB Tracks are append-only. If a sample's scalar field is "updated" — e.g., label changes from "cat" to "dog" — the new value is written at a fresh time anchor. The scalar index will then contain entries for both the old and new value at different anchors.
Query semantics:
- Default (latest-as-of-version): the Manifest's "as-of" view (per
spec/0008) determines which Tracks are visible. Within visible Tracks, the most recent entry persample_anchorwins. This is a per-sample-id "latest" resolution done at read time. - All-versions: an explicit flag
include_overwritten: truereturns every entry across all versions. Useful for audit / time-travel queries.
Tombstones (sample deletion) are out of scope for v0.1. The natural extension: a sentinel entry with value = null masks all prior entries for that sample.
8. Sizing guidance
| Field shape | Cardinality | Algorithm |
|---|---|---|
| Categorical labels (class, region, country) | < 10K distinct values | dreamdb.bitmap-categorical |
| Boolean flags (split, is_train) | 2 | dreamdb.bitmap-bool |
| Numeric scores, timestamps, prices | > 10K distinct values | dreamdb.btree-{int64,float64} |
| Free-text searchable strings | unbounded | dreamdb.btree-string (lex order); full-text deferred |
ImageNet-100's label (100 distinct values) and split (2 values) both fit the bitmap path cleanly. A future "ResearchDataset" with 10K-class taxonomy would still fit (bitmap-categorical cardinality_limit = 65536 covers it). Anything denser uses B-tree.
The cardinality limit on dreamdb.bitmap-categorical is intentional: above ~64K distinct values, bitmaps stop being more efficient than B-tree leaves. The decoder rejects an index that exceeds the limit; the producer must re-train as B-tree.
9. Implementation roadmap
Mirrors the IVF rollout shape from spec/0004 §5.6:
| Step | Crate / file | What it adds |
|---|---|---|
| 1 | dreamdb-protocol/src/scalar_index.rs | ScalarIndexObject + CBOR + tests, parallel to SpatialIndexObject |
| 2 | dreamdb-protocol/src/bitmap_categorical.rs | First algorithm impl (matches ImageNet-100 label); uses Roaring bitmaps |
| 3 | dreamdb-protocol/src/track.rs | New InlineObjectIndex::ScalarBucket variant + paged-tree extension |
| 4 | dreamdb-protocol/src/verbs/query_scalar.rs | New verb |
| 5 | dreamdb-bench/src/ivf_training.rs (no — rename to index_training.rs) | Bitmap build step; bench validation |
| 6 | dreamdb-dataset/src/filter.rs | Lift the Phase-1 unsupported-filter gate; plumb Where clauses through |
| 7 | B-tree algorithms | dreamdb.btree-int64, dreamdb.btree-float64, dreamdb.btree-string (separate PR) |
Steps 1-6 are the ImageNet-100 happy path (categorical labels). Step 7 generalizes to numeric/lexical.
10. Open questions
- Sample-id model coupling. This spec assumes sample anchors are byte-comparable u64s — DreamDB's existing time-anchor semantics. If
design/0001ever introduces an explicit sample-id join Track (the deferred Phase 1 question), the bucket-record shape needs a per-record sample-id field. Resolution: defer until the join-Track decision is made; bitmaps internally already abstract over the "sample identifier" concept. - Roaring bitmap library choice. Rust ecosystem has
roaring(pure Rust, mature) andcroaring(C bindings). v0.1 picksroaring; revisit if benchmarks show > 2× perf gap. - Bitmap compression on the wire. Roaring serializes itself; we add nothing on top. Decoder reads the serialized bytes via
roaring::RoaringBitmap::deserialize_from. ETag stability across writers requires Roaring's serialization to be byte-deterministic — needs verification (Roaring's standard format is deterministic; double-check during impl). - Bucket address vs inline bitmap. For very small bitmaps (< 1 KB) inlining into the ScalarBucketEntry saves a fetch. Decision: inline when
bitmap.serialized_size() < INLINE_THRESHOLD(default 1024 bytes); theScalarBucketEntry.bucket_addressbecomes a magic-zero value when inlined and the bytes live in a siblinginline_bytesfield. Codified in §5 with a new variant.
11. Conformance
Adds a new test category to spec/0009 §5:
- 5.4. Scalar Index — ScalarIndex Object CBOR round-trip per algorithm; bucket-record sort order; bitmap union/intersection determinism; B-tree leaf-page paging at boundary sizes; multi-version latest-wins semantics.
Cross-implementation test vectors (the per-algorithm fixtures) live in dreamdb-conformance/fixtures/scalar/. Producers MUST hash-match every fixture's outputs (Index Object, bucket entries, returned anchors) before the implementation is certified.