Known Flaws — A Design Retrospective

As of 2026-05-15, after the Ada-IVF + auto-rebuild iteration.

Update 2026-05-15 (later that day): most of the architectural flaws listed below have been resolved or have a concrete shipping path. The "Status" line at the top of each flaw tracks the current state:

✅ Resolved: shipped, tests green, live data verified.
🟡 Partially resolved: foundation shipped, follow-up optimizations queued.
🔵 Unblocked: blocker removed, implementation now bounded.
⏳ Outstanding: still flawed as described.

Updates are inline below; the original analysis is preserved so the chain of reasoning stays visible.

This document catalogs DreamDB's current limitations by reconstructing the chain of decisions that produced each one. The pattern is consistent enough to name explicitly: we keep solving the local symptom of a problem by adding a layer that introduces its own symptom one level down. Most flaws are not isolated bugs — they are the next layer's surface of a deeper unresolved tension.

Each section follows the same shape:

Problem — what we were trying to solve.
What we built — the mechanism we shipped.
New issue surfaced — the cost or limitation we now have to live with.
Root cause — the deeper architectural constraint that produced the symptom in the first place.

The closing section maps these to the architectural tensions they share.

1. Auto-rebuild is additive-only and unbounded

Status: ✅ Resolved (2026-05-15). Two fixes:

Inline auto-rebuild was deleted entirely (Phase 1 of design/0003). Maintenance is now operator-driven via dreamdb-cli.
ada-ivf-step --merge-threshold added (Phase 2.2). The CLI now merges underpopulated cells alongside splitting hot ones, structurally bounding k growth. Live evidence: 231K imagenet-100 dataset, k=27,248 → 24,526 in one pass — first observed rebuild that shrunk k.

Problem

Operator-driven dreamdb-cli rebuild-ivf is the principled way to keep an IVF index healthy as data shifts, but it requires the operator to notice imbalance, schedule a job, and wait for completion. Small-scale users wanted "the index just stays healthy" without that overhead.

What we built

Schema.add_embedding(auto_rebuild=True, max_n=10M, threshold=1.5). On every append_many, after the per-cell merge-on-write loop builds the new combined bucket entries, we compute per-cell counts, derive the coefficient-of-variation imbalance score, and if it crosses threshold AND total_n ≤ max_n we run a localized re-cluster inline via Dataset::ada_ivf_step_inline. The same Manifest publish covers both the appended records and the rebuilt SI atomically.

New issue surfaced

Auto-rebuild only splits cells; it never merges them. Each fire grows k by ~30% (n_splits ≈ 2–8 per hot cell). On the imagenet-100 recreate this produced a measured climb of:

k = 447 → 593 → 701 → 1065 → 1559 → 2017 → 2680 → 3654 → 4948 → 6708
     → 8949 → 12340 → 16733

At each step the per-batch ingest cost grew with k (see flaw §4) and average throughput fell from 280/s to 52/s over the course of one ingest. The "self-healing" mechanism hurt the workload it was supposed to help. It also kept firing in an infinite-extension pattern: density hits the gate → split → next density gate is higher → wait for more data → split again. No equilibrium exists because the splits operate only in one direction.

Root cause

The dreamdb_protocol::ada_ivf module shipped local_split, update_centroids, and find_overpopulated_partitions but never a merge_partitions primitive. The find_underpopulated_partitions function exists but is unused.
The Mohoney et al. Ada-IVF paper (arXiv 2411.00970 §4) explicitly prescribes BOTH splits AND merges with a global k-cap. We implemented half the algorithm.

What "right" would look like

Add a merge primitive that combines an underpopulated cell with its nearest neighbour. Requires composing update_centroids to support merge-replacement in addition to drop-and-append.
Add a hard cap: k_max = 2 · √N. When auto-rebuild would push k past this, it must MERGE before splitting, or refuse to fire and emit a "operator must rebuild-ivf" warning.
Default auto_rebuild=False and document it as a small-scale convenience, not the production maintenance path.

2. Every SI change forces O(N) record re-dispatch

Status: 🟡 Partially resolved (2026-05-15). Chain-aware lineage shipped (Phase 3.1) — SpatialIndexObject carries parents: Vec<Multihash> per spec/0004 §3.5, bucket lineage check walks the chain up to 100 ancestors. Cold-bucket skip in rebuild_all_buckets (Phase 3.2) uses the same id-map as update_centroids to identify preserved cells; those cells skip decode + redispatch entirely. Live: 35% of cells preserved on the first imagenet-100 split-only test. Outstanding: redispatch within shifted-position cells still re-PUTs identical bytes under the new spatial_key path; an address-scheme change (move spatial_key off the path, into the Track entry only) would eliminate the re-PUT — deferred to a future iteration.

Problem

After Ada-IVF or full-rebuild produces a new SI, the SDK has to ensure no record is queried against a centroid set it wasn't placed under — otherwise queries would mis-route and recall would collapse silently.

What we built

Bucket-header lineage check (spec/0007 §6.1.2). Every SpatialBucket Object's header carries the 33-byte spatial_index_hash of the SI it was placed under. Dataset::append_many reads each prior bucket's header and refuses to merge if the hash differs from the current schema's SI hash. The error suggests "re-ingest from scratch into a fresh Ref."

New issue surfaced

Any centroid change requires rewriting every bucket in the dataset. Even when only one hot cell needs splitting (touching, say, 0.1% of records), the other 99.9% must be GET+decode+re-bucket+PUT just so their bucket headers carry the new SI hash. We measured this cost on the imagenet-100 recreate: each Ada-IVF step re-dispatched all 131K records (~30 s), and at 10B records it would take hours.

Worse: the re-dispatch goes through vc.decode(...), which for RaBitQ produces an approximate f32 reconstruction. Every rebuild compounds quantization error — records rebucketed many times drift further from their "true" centroid.

Root cause

Lineage was modelled as a strict equality check, not as a chain. The bucket header carries one hash and there's no notion of "this SI descends from that older SI". So an updated SI is always treated as incompatible with all prior buckets.

What "right" would look like

Bucket header carries spatial_index_lineage: Vec<Multihash> (the chain of SI ancestors). Lineage check passes if the current SI's hash is in any ancestor's parents chain.
SI Object gains a parents: Vec<Multihash> field — the SIs it evolves from.
For cells whose centroid was UNCHANGED across the SI update, the bucket needs no rewrite. For cells whose centroid was REPLACED, records must still be re-dispatched (correctness) — but only those records, not the whole dataset.

This is a spec-level change. It probably also requires the SI to record which centroid indices changed between ancestor and self, so SDKs can mechanically compute "is bucket X still valid".

3. Inline auto-rebuild blocks the writer

Status: ✅ Resolved (2026-05-15). Deleted entirely (Phase 1). The "inline" framing was wrong — maintenance is now async via operator-scheduled CLI. The throughput collapse from 280/s → 52/s observed during the recreate run was caused by this; without inline rebuilds, ingest throughput returns to peak rates (limited only by per-batch IvfCosine.hash_vector cost at high k, which is a separate concern bounded by Phase 3.2).

Problem

We wanted the append path to fix imbalance on its own. Async background maintenance would require a daemon, which conflicts with DreamDB's no-daemon design (cron / k8s CronJob / GitHub Actions instead).

What we built

Dataset::ada_ivf_step_inline runs inside append_many between the bucket-consolidation loop and the Manifest publish. If imbalance crosses threshold AND N ≤ max_n, it fires synchronously: GET every bucket, decode, redispatch, PUT new buckets, publish new SI, then continue with the normal Manifest publish.

New issue surfaced

The writer stalls for the full duration of the rebuild. At 131K records this was ~30 seconds; at 1M records it'd be ~5 minutes; at 10M (our self-imposed max_n) it'd be ~50 minutes. The max_n knob was supposed to bound this — but a 50-minute "automatic" stall is not what users expect from a streaming append API. Above max_n we fall back to an eprintln! warning, which most callers won't see.

Root cause

The fundamental tension: DreamDB's no-daemon stance + immutability + content-addressing means "background work" must come from external schedulers. There's no in-protocol way to defer work without somebody running the deferred job. Inline was the only way to keep auto_rebuild self-contained — but inline means synchronous.

What "right" would look like

Two changes:
1. The "imbalance check at append time" is fine (cheap). It should emit a Manifest-registry signal (dreamdb.recommendations) that external monitoring picks up and triggers ada-ivf-step async.
2. The auto-rebuild firing should be ripped out. Replace it with "tell the operator to schedule a rebuild" — even at small scale.
Or accept that some users will want "automatic" and provide a separate dreamdb-cli watch command that polls Manifests and runs ada-ivf-step when recommendations land. That's a daemon, but it's an OPT-IN one external to the protocol.

4. Per-batch cost scales O(k·dim)

Status: 🟡 Partially resolved (2026-05-15). The merge step from flaw §1's fix now bounds k growth, addressing the root cause. The intrinsic O(k·dim) cost remains (it's how IVF works), but k now stays near √N for healthy workloads. Future: parallelize hash_vector via rayon (~30 LOC, would 5-8× the dispatch throughput on multi-core). IMI partitioning (already in protocol, not used in production datasets) would also help at extreme k.

Problem

IVF dispatch needs to compute, for each new record, which of the k centroids it's closest to. This is the natural mechanic.

What we built

IvfCosine::hash_vector does k dot products of dim-d vectors per record. Single-threaded f32 left-fold per spec/0004 §5.4.

New issue surfaced

At inflated k (say 16,733 at dim 512), per-record hash cost is ~3 ms. Per 256-sample batch that's ~770 ms — just for dispatch. Combined with merge-on-write HTTP (~2 s/batch at k=16K) and CLIP encode (~100 ms), the per-batch cost climbs to ~3 s and throughput drops to ~85 samples/s. We observed this during the imagenet-100 recreate.

Root cause

Linear-in-k cost is intrinsic to flat IVF. The reason it hurts is flaw §1: auto-rebuild inflated k far past √N. With a properly-sized k (≈ 363 for 131K records), the cost would be ~25× cheaper and the ingest would run at 500-2000 samples/s.

What "right" would look like

Fix flaw §1 (cap k growth).
Optionally adopt IMI (Inverted Multi-Index) for the partitioning, which factorizes the k-dimensional centroid lookup into 2 × √k half-space lookups. Spec already defines dreamdb.imi-cosine for this purpose; we're not using it in production datasets.
Multi-thread the dispatch via rayon: each batch of 256 records can be hashed in parallel, dropping the ~770 ms to ~100 ms on an 8-core machine.

5. Sharded ada-ivf-step is half a solution

Status: 🔵 Unblocked (2026-05-15). Chain-aware lineage + cold-bucket skip mean the redispatch step is now O(touched cells), not O(N). The orchestrator's record-redispatch in rebuild_all_buckets was also parallelized via buffer_unordered(16) for the fetch step. Implementing true sharded redispatch (workers handle their slice's replaced cells in parallel, orchestrator stitches paged Track leaves) is now bounded code, not architectural redesign. Deferred to Phase 3.3.

Problem

Single-machine ada-ivf-step on 10B records would take ~3 hours. We wanted a path that scales horizontally across k8s pods.

What we built

Two-stage sharded mode:

Workers (--shard N --of M --job-id X): each worker claims hot cells where cell_id % M == N, decodes their records, runs local_split, writes shard JSON at <bucket>/_ada_ivf/<job-id>/centroids/shard-NNNN.json.
Orchestrator (--orchestrate --job-id X): reads all shard JSONs, aggregates centroid replacements, publishes new SI, then re-dispatches EVERY record (single-machine), publishes Manifest, CAS the Ref.

New issue surfaced

Only the centroid-computation step is parallelized. The expensive step — record re-dispatch — runs serially on the orchestrator. At 10B records, decode + hash_vector + bucket re-PUT is the bottleneck regardless of how many workers we have. The sharded mode's wall-clock improvement is ~30% (the local_split portion), not 100× as the parallel-workers naming implies.

A second stage of sharding could distribute the redispatch (each worker handles records in cells it owns), but that adds a second synchronization barrier (workers need to know new SI before dispatching → orchestrator publishes SI → second worker pass dispatches → second orchestrator pass finalizes Manifest). Four sequential k8s Jobs.

Root cause

The orchestrator HAS to single-thread the re-dispatch because the final Track is one CBOR Object. Parallel workers can each produce sub-buckets, but assembling them into a single Track is serial. To truly distribute, the Track would need to be paged (B-tree of leaf pages), and our ada-ivf-step doesn't yet support paged tracks (flaw §6).

What "right" would look like

Phase-3 sharded redispatch: each worker handles records in cells it owns, after orchestrator publishes new SI. Each worker emits a paged-track LEAF Object. Final orchestrator stage assembles leaves into a B-tree.
Paged-Track support in update_centroids and bucket re-dispatch paths. Currently rejected outright (flaw §6).

6. Paged tracks aren't supported by rebuild verbs

Status: ✅ Resolved (2026-05-15, Phase 3.4). ada-ivf-step now READS paged TrackObjects via B-tree walk and WRITES paged TrackObjects via bottom-up B-tree build (leaf=1000 entries, fanout=100). Inline-vs-paged decision auto-fires at 8000-entry threshold (~960 KB). Combined with chain-aware lineage + cold-bucket skip, rebuilds at 1B-cell scale are now O(touched cells) for the bucket pass plus O(N) for the Track B-tree rebuild — the latter is the natural Phase 3.4b target (incremental B-tree update for cells whose entries didn't change).

Problem

At ~10K inline track entries the Manifest's inline-array form crosses 1 MiB and per spec/0002 §7.2.2 the track switches to a paged B-tree. This is necessary at 1B-record scale.

What we built

Dataset::append_many handles paged tracks (B-tree maintenance is in dreamdb-protocol). But dreamdb-cli ada-ivf-step and the inline auto-rebuild path BOTH bail out with "paged tracks not yet supported" when they encounter one.

New issue surfaced

Maintenance verbs become unavailable at exactly the scale where you most need them. A 1B-record dataset has tens of thousands of populated cells → its track is paged → ada-ivf-step refuses to run → the only path is rebuild-ivf from scratch.

Root cause

Paged-track read/write requires walking a B-tree of Index Pages, splitting/merging on insert, etc. Substantial code that the rebuild verbs need but didn't get because the simpler inline path was enough for our test scale.

What "right" would look like

Implement paged-track support in ada_ivf_step.rs and ada_ivf_step_inline. Adds maybe ~200 LOC. Without it, the entire maintenance story is "works at 1M records, broken at 1B" — the inverse of what DreamDB claims.

7. `ada-ivf-status` lies about imbalance

Status: ✅ Resolved (2026-05-15, Phase 2.1). ada-ivf-status now reads the current Manifest's Track instead of LIST-PREFIX. Output adds underpopulated_partitions so operators can size the merge step. Verified on imagenet-100: pre-fix reported 100K buckets across 42K populated partitions (counting historical orphans); post-fix reports the accurate 24K buckets across 21K populated partitions.

Problem

Operators want a cheap way to check "should I run ada-ivf-step?" without fetching the current Manifest + Track.

What we built

dreamdb-cli ada-ivf-status walks the LIST-PREFIX of the modality's spatial-key space (<timeline>/<modality>/), counts every bucket Object it sees, computes per-cell counts and imbalance.

New issue surfaced

List-prefix sees every historical bucket, not just live ones. After the imagenet-100 recreate the current Track had 16,702 live buckets but ada-ivf-status reported 100,347 buckets across 42,575 partitions. The imbalance score it produced (1.21) was computed over a fictional distribution that mixed live + dead buckets.

After the manual GC pass (87K orphans deleted) the status was correct, but only because GC happened to run. Without GC, ada-ivf-status is permanently wrong on any dataset that has ever been rebuilt.

Root cause

We chose the "cheap" implementation (list-prefix, no Manifest fetch) over the correct one (resolve Ref → Manifest → Track → walk entries). The cheap path looks right on a fresh dataset but accumulates lies over the dataset's lifetime.

What "right" would look like

Replace the list-prefix scan with a Track walk. Costs one extra HTTP GET (the Manifest) plus one per Track Object. Trivial. Should have been the original implementation. The current ada_ivf_step code in single-machine mode already walks the Track — that's the right code, just split out into its own verb.

8. GC requires manual scripting

Status: ✅ Resolved (2026-05-15, Phase 2.4). dreamdb-cli gc is a first-class verb with --keep-manifests N, --keep-since DURATION, --dry-run. Parallelizes HEAD requests via buffer_unordered(32) and DELETE requests too. Verified: deleted 22,501 orphans from imagenet-100, dataset still resolves correctly. Outstanding nuance: walks parents[0] only, so multi-parent merge histories aren't fully preserved (out of scope until Phase 3.5 ships multi-parent merge).

Problem

DreamDB is immutable and append-only. Every rebuild, every append, every schema migration produces NEW Objects without removing OLD ones. Over time the bucket fills with orphans.

What we built

A spec definition (spec/0006 §7.3) of mark-and-sweep GC with a 24h Last-Modified threshold — and nothing else. The sample script we wrote this session (/tmp/dreamdb_gc.py, 175 lines of Python with boto3) is the entire implementation.

New issue surfaced

Operators have no path to bounded storage. Without GC, every rebuild multiplies bucket count. On a 1-year-old dataset with daily rebuilds you'd have 365× more bucket Objects on disk than are reachable from the current Manifest. List operations slow down O(historical Objects) linearly.

The manual script we wrote walks only the CURRENT Manifest — meaning running it destroys all time-travel snapshots. So the operator has to choose between "infinite storage growth" and "no time-travel". That's not a real choice.

Root cause

GC is on the spec roadmap (spec/0006 §7.3) but isn't implemented and there's no dreamdb gc CLI verb. The spec defines the discipline but not the retention policy: "what manifests should be preserved" is left to the operator.

What "right" would look like

dreamdb gc --keep-manifests=N — keep the N most recent Manifests reachable from each Ref, GC everything else. Default N=100 so daily rebuilds get 100 days of time-travel.
dreamdb gc --keep-since=24h — preserve everything modified within 24 hours (the spec's safety threshold).
Combined: respect both filters; never delete an Object that's still reachable from any preserved Manifest.

9. No deletion / tombstones

Status: ✅ Resolved (2026-05-18, B8 in 10B-scale push). spec/0020 defines TombstoneListObject (anchor-keyed, parent-DAG, canonical CBOR). Dataset::delete(&[u64], reason) + Dataset::tombstone_set() + dreamdb delete CLI ship the operator surface. Read paths (iter_with_fields, iter_stream) auto-consult the tombstone set; deleted anchors disappear from queries without rewriting the underlying Track. Storage compaction (reclaim bytes for tombstoned records) is deferred per spec/0020 §6. 9 new tests, 721 total green.

Problem

DreamDB is append-only by design — append, never mutate.

What we built

... nothing. The protocol has no deletion verb. The spec doesn't address it.

New issue surfaced

Records that were ingested wrong (corrupted CLIP embedding, bad parquet row, GDPR-mandated deletion) cannot be removed. The only workaround is to read the current Track, build a new Track that excludes the unwanted records, and publish a new Manifest pointing at it. This is a full data rewrite for one deletion.

Root cause

Append-only is the protocol's defining design choice. It's what unlocks content-addressing, time-travel, and lock-free reads. But it has no story for the "this row shouldn't exist" use case that real production systems hit weekly.

What "right" would look like

dreamdb.tombstones registry entry: per-modality, a list of (track_position, anchor_hash) pairs marking deleted records. Query path skips records whose ordinal+hash matches.
Eventually-compacting: on the next rebuild, the rebuilt Track omits tombstoned records entirely.
GDPR-compliant: the original record's bytes are NOT deleted from the content store (it's content-addressed, possibly shared) but the path from any Ref to those bytes is severed.

This needs spec work.

10. Append + rebuild conflict has no recovery path

Status: ✅ Resolved (2026-05-18, B2 in 10B-scale push). Dataset::branch(name), Dataset::merge(other, MergeStrategy::FastForward), and MergeStrategy::UnionTracks (3-way fused-merge with LCA walk + per-cell bucket reconciliation) all ship. Dataset::merge_many(&[branches]) + dreamdb merge-many CLI orchestrate N-way sharded ingest. The protocol-level framing (layered-merge vs fused-merge) is now in spec/0008 §5.3; the algorithm lives in design/0007-sharded-ingest.md. Outstanding: Fragment/Scalar/Constant union-merge (currently the algorithm refuses non-embedding diverged tracks — v0.1 extension).

Problem

Concurrent append and rebuild on the same Ref would race the SI swap: appender places records under old SI; rebuilder publishes new SI; final Ref CAS by whichever loses → records are silently mis-routed.

What we built

SpatialIndex conflict = MUST-REFUSE merge (per project_collab_disciplines.md). Both writers do CAS; one wins; the other gets CasFailed and bails out. Documentation says "use a feature branch".

New issue surfaced

Dataset.branch() is not implemented. The advice to "use a feature branch" has no API. Operators who hit the conflict have nothing they can do programmatically — they have to manually create a new Ref via the connector and figure out coordination themselves.

Root cause

Refs are 33-byte content pointers under <bucket>/refs/<name>. Branching should just be "create a new Ref pointing at the current Manifest" — one PUT. The mechanics are trivial; the API surface and merge story isn't there.

What "right" would look like

Dataset::branch(new_ref_name) -> Result<Dataset> — PUT a new Ref at current Manifest hash; return a Dataset bound to it.
Dataset::merge(other_ref, strategy: MergeStrategy) -> Result<()> — build a new Manifest with parents=[self.tip, other.tip]. Strategy options: refuse on SI conflict, fast-forward only, etc.
These belong in dreamdb-dataset/src/dataset.rs next to create / open.

11. Decode-on-rebuild compounds quantization error

Status: 🟡 Partially mitigated (2026-05-15). Cold-bucket skip (Phase 3.2) means cells whose centroid is preserved never get re-decoded — those records sit on disk indefinitely with the same compressed codes. Only the replaced cells' records pass through vc.decode → hash_vector → re-encode. So the compounding only hits records in cells that actually changed, not every record on every rebuild. Materially better but not perfect. Real fix is rerank=True schemas (raw f32 stored alongside codes) — already shipped, just not used on the imagenet-100 demo.

Problem

The rebuild verbs (ada-ivf-step and the inline auto-rebuild) need to re-dispatch records — meaning they need each record's f32 vector to compute its new centroid id.

What we built

For each record we call vc.decode(codes) to recover an approximate f32 vector via inverse rotation of the RaBitQ codes. That's then fed into IvfCosine::hash_vector(...) to get the new spatial key.

New issue surfaced

1-bit RaBitQ decode is approximate (per-dim is ±scale, not the original f32 value). Records that are rebucketed many times drift further from where they "should" be — each rebuild's decoded f32 is already lossy, so the new spatial_key it's assigned is computed from a worse approximation than the previous one.

Empirically OK at our scale (recall stayed >95% after 10+ rebuilds) but a real failure mode at long-lived, heavily-rebuilt datasets where the same records sit at the same cell across many rebuild cycles.

Root cause

We store ONLY the compressed codes; the original f32 is gone after encoding. Recovering f32 from 1-bit codes is fundamentally lossy.

What "right" would look like

The two-pass-rerank schema (rerank=True) keeps raw f32 alongside compressed codes in a parallel VectorStorage. Rebuilds on rerank-on datasets can use the RAW vectors for re-dispatch — exact, not approximate.
Without rerank, the only fix is to NOT decode + redispatch — i.e. abandon Ada-IVF for raw datasets and only support rebuild-ivf (a full retrain that uses fresh data, not decoded data).

12. Hidden cost: merge-on-write HTTP round-trips per batch

Status: ⏳ Outstanding (but bounded). Still ~N HTTP GET + N HTTP PUT per batch for the N cells touched. With k now bounded near √N (flaw §1 fix), the average batch touches fewer cells than before; the throughput collapse from k inflation is gone. Future: backend-side compose (MinIO composeObject) or partitioned-bucket super-objects (spec/0007 has the concept but no implementation). Not on the critical path.

Problem

Earlier versions of DreamDB emitted one bucket per ingest batch per spatial key, producing many small buckets per cell (many_batches_consolidate_to_one_bucket_per_cell test caught this). We fixed it.

What we built

Per project_bucket_consolidation.md: each append_many reads every prior bucket for each cell it's about to write, merges with new records, and writes ONE consolidated bucket per cell.

New issue surfaced

Every batch pays N HTTP GET + N HTTP PUT for the N cells it touches. At dense ingest into a high-k dataset, a typical 256-sample batch spreads across ~200 cells. That's ~200 GETs + ~200 PUTs to MinIO, ~10 ms each, = ~4 seconds of HTTP for one batch. Throughput-bound at ~64 samples/s.

We measured this during the recreate at k=16,733. The HTTP overhead matches IvfCosine::hash_vector cost almost exactly — both scale with k.

Root cause

Each bucket is one Object. We don't have a way to write/merge multiple buckets in one HTTP request. The backend connector's put_multi and get_multi_range aren't used for bucket batches.

What "right" would look like

Batch bucket reads via get_multi_range against a "bucket super-object" — if buckets for many cells share a backing Object (with offset tables), one GET fetches them all. Spec/0007 has the partitioned-bucket concept but it isn't wired up for the typical case.
Or: backend-side multi-PUT. S3 doesn't natively support this but MinIO does via composeObject. The connector could compose multiple bucket payloads server-side.

This is a future optimization; not a correctness issue. But it's a big chunk of the observed ingest slowdown.

Progress summary (2026-05-15)

#	Flaw	Status
1	Auto-rebuild additive-only	✅ Resolved (Phase 1 delete + Phase 2.2 merge)
2	O(N) re-dispatch on every SI change	🟡 Partially resolved (Phase 3.1 + 3.2)
3	Inline auto-rebuild blocks writer	✅ Resolved (Phase 1)
4	Per-batch cost O(k·dim)	🟡 Partially resolved (k bounded by Phase 2.2)
5	Sharded ada-ivf-step half a solution	🔵 Unblocked by Phase 3.1+3.2
6	Paged tracks unsupported in rebuild	🔵 Unblocked by Phase 3.1+3.2
7	ada-ivf-status lies about imbalance	✅ Resolved (Phase 2.1)
8	GC requires manual scripting	✅ Resolved (Phase 2.4)
9	No deletion / tombstones	⏳ Outstanding
10	Append + rebuild conflict no recovery	🟡 Partially resolved (Phase 2.3)
11	Decode-on-rebuild quantization drift	🟡 Partially mitigated by Phase 3.2
12	Per-batch merge-on-write HTTP overhead	⏳ Outstanding (bounded)

5 resolved, 4 partially resolved, 2 unblocked, 3 outstanding (one outstanding is bounded enough not to be on the critical path; the other two — tombstones and full-multi-parent merge — are spec-level work for Phase 3.5+).

Meta-pattern: layers vs. roots

Reviewing the list, almost every flaw fits one of three patterns:

A. "Maintenance is async work" forced inline. Flaws §1, §3, §8 are all the same shape: we wanted a thing to happen "automatically" (rebuilds, GC, retraining) and the no-daemon stance forced it inline or onto the operator. The result is either stalls the writer or never runs.

The root: DreamDB's architectural decision to treat maintenance as external (cron / k8s / GitHub Actions) is the right call for a protocol, but the SDK keeps trying to be "self-healing" anyway. Either embrace the external-scheduler stance and remove the inline auto-rebuild, or build the missing scheduler primitive (e.g. a dreamdb watch command that's a single-process loop polling recommendations and running them).

B. "Strict equality" lineage that should be a chain. Flaws §2, §6, §10 are all the same shape: bucket lineage is checked as == against the current SI, with no concept of "this SI descends from that one". The result is full data rewrites on every centroid change, paged-track maintenance verbs being stubbed out, and no sane multi-writer collaboration.

The root: spec/0007's lineage check needs to be hash-chain aware. Add parents: Vec<Multihash> to SI Objects, change bucket lineage check to "is current SI's hash in any ancestor". This single change unlocks several of the listed flaws.

C. "Cheap" implementations that look right but accumulate lies. Flaws §7, §8, §11 are all the same shape: we shipped the simpler implementation (list-prefix counting, no GC, decode-and-redispatch) which works in isolation but accumulates pathologies as the dataset ages.

The root: every implementation should ask "what does this look like on a dataset that's been alive for 1 year?" The straightforward implementations only consider the day-1 dataset.

Honest current status

Update 2026-05-15 (later): After the same-day fix sweep (Phases 1, 2.1-2.4, 3.1, 3.2), the maintenance/operations layer is structurally healthy. Most of the bullets below have moved from "broken" to "shippable" or "addressed." The list as originally written still shows the design philosophy this work proved out: the protocol layer was correct; the maintenance layer needed scope-rationalization and the lineage-chain primitive. Both shipped today.

Honest current status (as originally written, preserved for the chain of reasoning)

What is solid:

Immutability + content-addressing
Time-travel for unmodified datasets (broken once we GC, but that's the chain-lineage fix)
RaBitQ encoding correctness (bit-identical Rust + JS)
Read-online property during rebuilds (queries always see consistent state)
Spec discipline

What is not ready for production:

Any dataset that needs to survive >1 year of growth (GC + lineage rigidity)
Any dataset that needs deletion (flaw §9)
Any dataset at 100M+ records that needs maintenance (flaws §5, §6)
Any team workflow with >1 writer (flaw §10)
Any deployment with strict latency budgets on append (flaw §3)

The protocol layer is well-grounded. The maintenance/operations layer above it has consistent gaps that all trace back to either the no-daemon stance not being fully embraced, the lineage check not being chain-aware, or the rush to ship the cheap path. The fixes are known; they're each on the order of a focused 1-2 week iteration. What's harder is admitting that some of what we shipped (specifically inline auto-rebuild) is the wrong abstraction and should be removed, not patched.

Known Flaws — A Design Retrospective

1. Auto-rebuild is additive-only and unbounded

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

2. Every SI change forces O(N) record re-dispatch

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

3. Inline auto-rebuild blocks the writer

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

4. Per-batch cost scales O(k·dim)

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

5. Sharded ada-ivf-step is half a solution

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

6. Paged tracks aren't supported by rebuild verbs

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

7. ada-ivf-status lies about imbalance

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

8. GC requires manual scripting

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

9. No deletion / tombstones

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

10. Append + rebuild conflict has no recovery path

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

11. Decode-on-rebuild compounds quantization error

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

12. Hidden cost: merge-on-write HTTP round-trips per batch

Problem

What we built

New issue surfaced

Root cause

What "right" would look like

Progress summary (2026-05-15)

Meta-pattern: layers vs. roots

Honest current status

Honest current status (as originally written, preserved for the chain of reasoning)

7. `ada-ivf-status` lies about imbalance