DreamDB Specification — 0021: Compaction
Status: Draft, 2026-05-22. Builds on
0001,0006,0007§6.6,0008§6.
1. Purpose
After the LSM retrofit (0007 §6.6), Dataset::append writes one new SpatialBucket per touched cell with only the records from this batch. The Track Object's object_index accumulates F SpatialBucketEntry entries per <spatial-key> over time. Reads union all entries for a cell, so query latency scales linearly with F.
This document defines the compaction protocol: how an operator consolidates F → 1 per cell, the safety properties readers and writers rely on during compaction, and the conformance requirements compactors MUST satisfy.
2. When compaction runs
Compaction is operator-driven. DreamDB SDKs MUST NOT auto-trigger compaction. Operator-driven means:
- Triggered by an explicit CLI invocation (
dreamdb-cli compact) or external orchestration (k8sCronJob, monitoring-based trigger). - The SDK never spawns background threads, daemons, or implicit compaction workers as part of normal
append/iteroperations.
Rationale: in-SDK background compaction adds failure modes (resource contention with the application, partial-flush races, debugging difficulty) that are inappropriate for a v1 protocol. Operator-driven compaction matches the model already established by ada-ivf-step (per 0008 §6 + project_rebuild_concurrency_rules).
Recommended operator cadences:
| Workload | Cadence |
|---|---|
| Bulk load, then quiet | One compaction after ingest completes |
| Light streaming (<100 rec/s, <100M dataset) | Hourly k8s CronJob |
| Heavy streaming (>1K rec/s) | Continuous worker pool (parallel shards) |
| Read-only archive | Never |
3. Compaction operations
A conformant compactor MUST perform the following steps atomically (one Manifest published at the end, single Ref CAS):
3.1 Identify candidate cells
- Read the Ref to obtain the current Manifest hash + ETag.
- Read the Manifest; find the target modality's
TrackEntry. - Walk the Track Object's
object_index:- Inline SpatialBucket index: enumerate
SpatialBucketEntrydirectly. - Paged SpatialBucket index: walk the B-tree (per
0007§7.3.2).
- Inline SpatialBucket index: enumerate
- Group entries by
spatial_key. Per-cell fragment countF = entries_for_cell.len(). - Select cells where
F > threshold(operator-supplied, default 1).
3.2 Per-cell merge
For each selected cell:
- Fetch all fragments concurrently (
GETeachSpatialBucketObject by content hash). - Validate compatible headers. All fragments MUST share
modality,record_size,spatial_index_hash, andvector_compressor_hash. If any differ, compaction MUST fail loudly with a clear error identifying the affected cell. Mismatched headers indicate the operator must use a feature-branch reindex (per0008§6) rather than a compaction. - Union records by
time_anchor. When two fragments contain the sametime_anchor:- If the record bytes match exactly → keep one (deduplication).
- If the record bytes differ → compaction MUST fail loudly. Different vectors at the same anchor indicate two writers ingested the same logical Item (slice-assignment bug per
0008§5); compaction MUST NOT silently choose one.
- Sort the merged records by
time_anchor. - Encode + PUT one consolidated
SpatialBucketObject (content-addressed). - Emit one replacement
SpatialBucketEntrywhoset_start = min(records),t_end = max(records) + 1,byte_size = consolidated_byte_size,bucket_address = consolidated_hash. Carry overrerank_storage_hash(the rerank-VS consolidation is symmetric and follows the same pattern; see §5).
3.3 Publish
- Build the new Track Object: keep entries NOT in the compacted set; append the replacements; sort by
(spatial_key, t_start). - PUT the new Track Object.
- PUT a new Manifest with the updated Track address; registry unchanged (SI/VC unchanged).
- CAS-advance the Ref using the ETag captured in §3.1 step 1.
3.4 CAS conflict handling
If the Ref CAS fails (a concurrent writer landed during the compaction):
- Compaction MUST fail loudly with an error directing the operator to use a feature branch (per
0008§6) or to retry. - The Manifest + new Bucket Objects from this compaction remain on S3 (orphaned, GC-reclaimable via
dreamdb-cli gc). - The Ref still points at the prior consolidated state OR the writer's new tip.
4. Read-online property (mandatory)
During compaction:
- Queries MUST continue to hit the old Manifest (the one pinned by the Ref) until the atomic CAS advances. No window where queries slow down.
- After the CAS, new queries see the new Manifest immediately. Buckets reference content hashes; the new Bucket Objects are addressable as soon as their PUTs complete.
- The old Bucket Objects remain addressable on S3 (immutable, content-addressed). They become unreferenced once the new Manifest is the Ref's target, but content-addressing means in-flight queries against the OLD Manifest still resolve correctly until those queries complete.
This is the same property as the rebuild-concurrency rules pinned in 0008 §6 — "read-online, write-needs-branch."
5. Rerank VectorStorage consolidation
For modalities with rerank=true, each SpatialBucketEntry carries a rerank_storage_hash pointing at a parallel VectorStorage Object holding raw f32 vectors mirroring the bucket's record order (per 0010 §8). When compacting fragments:
- Fetch all per-fragment
VectorStorageObjects. - Union records in the same order as the consolidated bucket's records.
- Encode + PUT one consolidated
VectorStorageObject. - The replacement
SpatialBucketEntry.rerank_storage_hashMUST point at the consolidated VS.
v1 implementations MAY skip rerank-VS consolidation (carry over the first fragment's hash) at the cost of read-time rerank inaccuracy on consolidated cells. Operators SHOULD prefer full consolidation when rerank is enabled.
6. Sharded compaction (optional)
A conformant compactor MAY split work across multiple workers for billion-scale compaction:
- Worker phase (
--shard N --of M): each worker handles cells wherehash(spatial_key) % M == N. Each worker writes a per-shard JSON output listing its compacted entries. - Orchestrator phase (
--orchestrate --job-id ID): one orchestrator reads allMshard outputs, builds the new Track + Manifest from the union, CAS-advances the Ref.
Worker outputs MUST be content-addressed (idempotent re-runs are safe). Failed workers MUST be re-runnable without corrupting other workers' state.
7. Idempotence
A compactor running on an already-consolidated dataset (all F ≤ threshold) MUST be a no-op:
- No new Bucket Objects, no new Track Object, no new Manifest.
- The Ref MUST remain at its current state.
- Exit code MUST indicate success (no-op is success, not failure).
This allows operators to run compaction defensively — e.g., before a snapshot, after a rebuild — without worrying about wasted work.
8. Conformance test vectors (deferred to 0009)
The conformance suite at 0009 §11 (added in this revision) carries:
- Multi-bucket reads — append N batches into k cells, query, verify top-K matches brute force.
- Compact idempotence — compact a consolidated dataset, assert no-op.
- Compact correctness — compact a fragmented dataset, assert queries return the same anchors as pre-compact.
- Lineage refusal — compact across an SI hash change, assert compaction fails with the documented error.
- Anchor conflict refusal — compact two fragments with the same time_anchor but different vectors, assert compaction fails with the documented error.
- Read-online property — start a query against the OLD Manifest while compaction runs, assert it completes correctly.
9. Open questions
| OQ | Description |
|---|---|
| OQ-44 | Should compaction support a --rebuild-rerank flag for rerank-VS consolidation when the v1 carry-over-first-hash behavior is in use? |
| OQ-45 | Should sharded compaction emit a single content-addressed orchestration manifest that workers register into, vs. per-shard JSON outputs? Decided per ada-ivf-step precedent (per-shard JSON, simpler) but worth revisiting at billion scale. |
| OQ-46 | Should the SDK expose a Dataset::observe_fragments() API that reports per-cell F for operator monitoring? Useful for "alert when F > 50" pipelines; pure observability, no mutation. |
Resolutions for OQ-44/-45/-46 land in a follow-up revision once empirical operator feedback exists.