DreamDBv0.2.0bec026

DreamDB Specification — 0021: Compaction

Status: Draft, 2026-05-22. Builds on 0001, 0006, 0007 §6.6, 0008 §6.

1. Purpose

After the LSM retrofit (0007 §6.6), Dataset::append writes one new SpatialBucket per touched cell with only the records from this batch. The Track Object's object_index accumulates F SpatialBucketEntry entries per <spatial-key> over time. Reads union all entries for a cell, so query latency scales linearly with F.

This document defines the compaction protocol: how an operator consolidates F → 1 per cell, the safety properties readers and writers rely on during compaction, and the conformance requirements compactors MUST satisfy.

2. When compaction runs

Compaction is operator-driven. DreamDB SDKs MUST NOT auto-trigger compaction. Operator-driven means:

  • Triggered by an explicit CLI invocation (dreamdb-cli compact) or external orchestration (k8s CronJob, monitoring-based trigger).
  • The SDK never spawns background threads, daemons, or implicit compaction workers as part of normal append/iter operations.

Rationale: in-SDK background compaction adds failure modes (resource contention with the application, partial-flush races, debugging difficulty) that are inappropriate for a v1 protocol. Operator-driven compaction matches the model already established by ada-ivf-step (per 0008 §6 + project_rebuild_concurrency_rules).

Recommended operator cadences:

WorkloadCadence
Bulk load, then quietOne compaction after ingest completes
Light streaming (<100 rec/s, <100M dataset)Hourly k8s CronJob
Heavy streaming (>1K rec/s)Continuous worker pool (parallel shards)
Read-only archiveNever

3. Compaction operations

A conformant compactor MUST perform the following steps atomically (one Manifest published at the end, single Ref CAS):

3.1 Identify candidate cells

  1. Read the Ref to obtain the current Manifest hash + ETag.
  2. Read the Manifest; find the target modality's TrackEntry.
  3. Walk the Track Object's object_index:
    • Inline SpatialBucket index: enumerate SpatialBucketEntry directly.
    • Paged SpatialBucket index: walk the B-tree (per 0007 §7.3.2).
  4. Group entries by spatial_key. Per-cell fragment count F = entries_for_cell.len().
  5. Select cells where F > threshold (operator-supplied, default 1).

3.2 Per-cell merge

For each selected cell:

  1. Fetch all fragments concurrently (GET each SpatialBucket Object by content hash).
  2. Validate compatible headers. All fragments MUST share modality, record_size, spatial_index_hash, and vector_compressor_hash. If any differ, compaction MUST fail loudly with a clear error identifying the affected cell. Mismatched headers indicate the operator must use a feature-branch reindex (per 0008 §6) rather than a compaction.
  3. Union records by time_anchor. When two fragments contain the same time_anchor:
    • If the record bytes match exactly → keep one (deduplication).
    • If the record bytes differ → compaction MUST fail loudly. Different vectors at the same anchor indicate two writers ingested the same logical Item (slice-assignment bug per 0008 §5); compaction MUST NOT silently choose one.
  4. Sort the merged records by time_anchor.
  5. Encode + PUT one consolidated SpatialBucket Object (content-addressed).
  6. Emit one replacement SpatialBucketEntry whose t_start = min(records), t_end = max(records) + 1, byte_size = consolidated_byte_size, bucket_address = consolidated_hash. Carry over rerank_storage_hash (the rerank-VS consolidation is symmetric and follows the same pattern; see §5).

3.3 Publish

  1. Build the new Track Object: keep entries NOT in the compacted set; append the replacements; sort by (spatial_key, t_start).
  2. PUT the new Track Object.
  3. PUT a new Manifest with the updated Track address; registry unchanged (SI/VC unchanged).
  4. CAS-advance the Ref using the ETag captured in §3.1 step 1.

3.4 CAS conflict handling

If the Ref CAS fails (a concurrent writer landed during the compaction):

  • Compaction MUST fail loudly with an error directing the operator to use a feature branch (per 0008 §6) or to retry.
  • The Manifest + new Bucket Objects from this compaction remain on S3 (orphaned, GC-reclaimable via dreamdb-cli gc).
  • The Ref still points at the prior consolidated state OR the writer's new tip.

4. Read-online property (mandatory)

During compaction:

  • Queries MUST continue to hit the old Manifest (the one pinned by the Ref) until the atomic CAS advances. No window where queries slow down.
  • After the CAS, new queries see the new Manifest immediately. Buckets reference content hashes; the new Bucket Objects are addressable as soon as their PUTs complete.
  • The old Bucket Objects remain addressable on S3 (immutable, content-addressed). They become unreferenced once the new Manifest is the Ref's target, but content-addressing means in-flight queries against the OLD Manifest still resolve correctly until those queries complete.

This is the same property as the rebuild-concurrency rules pinned in 0008 §6 — "read-online, write-needs-branch."

5. Rerank VectorStorage consolidation

For modalities with rerank=true, each SpatialBucketEntry carries a rerank_storage_hash pointing at a parallel VectorStorage Object holding raw f32 vectors mirroring the bucket's record order (per 0010 §8). When compacting fragments:

  • Fetch all per-fragment VectorStorage Objects.
  • Union records in the same order as the consolidated bucket's records.
  • Encode + PUT one consolidated VectorStorage Object.
  • The replacement SpatialBucketEntry.rerank_storage_hash MUST point at the consolidated VS.

v1 implementations MAY skip rerank-VS consolidation (carry over the first fragment's hash) at the cost of read-time rerank inaccuracy on consolidated cells. Operators SHOULD prefer full consolidation when rerank is enabled.

6. Sharded compaction (optional)

A conformant compactor MAY split work across multiple workers for billion-scale compaction:

  • Worker phase (--shard N --of M): each worker handles cells where hash(spatial_key) % M == N. Each worker writes a per-shard JSON output listing its compacted entries.
  • Orchestrator phase (--orchestrate --job-id ID): one orchestrator reads all M shard outputs, builds the new Track + Manifest from the union, CAS-advances the Ref.

Worker outputs MUST be content-addressed (idempotent re-runs are safe). Failed workers MUST be re-runnable without corrupting other workers' state.

7. Idempotence

A compactor running on an already-consolidated dataset (all F ≤ threshold) MUST be a no-op:

  • No new Bucket Objects, no new Track Object, no new Manifest.
  • The Ref MUST remain at its current state.
  • Exit code MUST indicate success (no-op is success, not failure).

This allows operators to run compaction defensively — e.g., before a snapshot, after a rebuild — without worrying about wasted work.

8. Conformance test vectors (deferred to 0009)

The conformance suite at 0009 §11 (added in this revision) carries:

  1. Multi-bucket reads — append N batches into k cells, query, verify top-K matches brute force.
  2. Compact idempotence — compact a consolidated dataset, assert no-op.
  3. Compact correctness — compact a fragmented dataset, assert queries return the same anchors as pre-compact.
  4. Lineage refusal — compact across an SI hash change, assert compaction fails with the documented error.
  5. Anchor conflict refusal — compact two fragments with the same time_anchor but different vectors, assert compaction fails with the documented error.
  6. Read-online property — start a query against the OLD Manifest while compaction runs, assert it completes correctly.

9. Open questions

OQDescription
OQ-44Should compaction support a --rebuild-rerank flag for rerank-VS consolidation when the v1 carry-over-first-hash behavior is in use?
OQ-45Should sharded compaction emit a single content-addressed orchestration manifest that workers register into, vs. per-shard JSON outputs? Decided per ada-ivf-step precedent (per-shard JSON, simpler) but worth revisiting at billion scale.
OQ-46Should the SDK expose a Dataset::observe_fragments() API that reports per-cell F for operator monitoring? Useful for "alert when F > 50" pipelines; pure observability, no mutation.

Resolutions for OQ-44/-45/-46 land in a follow-up revision once empirical operator feedback exists.