Known Flaws — A Design Retrospective
As of 2026-05-15, after the Ada-IVF + auto-rebuild iteration.
Update 2026-05-15 (later that day): most of the architectural flaws listed below have been resolved or have a concrete shipping path. The "Status" line at the top of each flaw tracks the current state:
- ✅ Resolved: shipped, tests green, live data verified.
- 🟡 Partially resolved: foundation shipped, follow-up optimizations queued.
- 🔵 Unblocked: blocker removed, implementation now bounded.
- ⏳ Outstanding: still flawed as described.
Updates are inline below; the original analysis is preserved so the chain of reasoning stays visible.
This document catalogs DreamDB's current limitations by reconstructing the chain of decisions that produced each one. The pattern is consistent enough to name explicitly: we keep solving the local symptom of a problem by adding a layer that introduces its own symptom one level down. Most flaws are not isolated bugs — they are the next layer's surface of a deeper unresolved tension.
Each section follows the same shape:
- Problem — what we were trying to solve.
- What we built — the mechanism we shipped.
- New issue surfaced — the cost or limitation we now have to live with.
- Root cause — the deeper architectural constraint that produced the symptom in the first place.
The closing section maps these to the architectural tensions they share.
1. Auto-rebuild is additive-only and unbounded
Status: ✅ Resolved (2026-05-15). Two fixes:
- Inline auto-rebuild was deleted entirely (Phase 1 of
design/0003). Maintenance is now operator-driven viadreamdb-cli. ada-ivf-step --merge-thresholdadded (Phase 2.2). The CLI now merges underpopulated cells alongside splitting hot ones, structurally bounding k growth. Live evidence: 231K imagenet-100 dataset, k=27,248 → 24,526 in one pass — first observed rebuild that shrunk k.
Problem
Operator-driven dreamdb-cli rebuild-ivf is the principled way to keep
an IVF index healthy as data shifts, but it requires the operator to
notice imbalance, schedule a job, and wait for completion. Small-scale
users wanted "the index just stays healthy" without that overhead.
What we built
Schema.add_embedding(auto_rebuild=True, max_n=10M, threshold=1.5). On
every append_many, after the per-cell merge-on-write loop builds the
new combined bucket entries, we compute per-cell counts, derive the
coefficient-of-variation imbalance score, and if it crosses threshold
AND total_n ≤ max_n we run a localized re-cluster inline via
Dataset::ada_ivf_step_inline. The same Manifest publish covers both
the appended records and the rebuilt SI atomically.
New issue surfaced
Auto-rebuild only splits cells; it never merges them. Each fire
grows k by ~30% (n_splits ≈ 2–8 per hot cell). On the imagenet-100
recreate this produced a measured climb of:
At each step the per-batch ingest cost grew with k (see flaw §4) and
average throughput fell from 280/s to 52/s over the course of one
ingest. The "self-healing" mechanism hurt the workload it was supposed
to help. It also kept firing in an infinite-extension pattern: density
hits the gate → split → next density gate is higher → wait for more
data → split again. No equilibrium exists because the splits
operate only in one direction.
Root cause
- The
dreamdb_protocol::ada_ivfmodule shippedlocal_split,update_centroids, andfind_overpopulated_partitionsbut never amerge_partitionsprimitive. Thefind_underpopulated_partitionsfunction exists but is unused. - The Mohoney et al. Ada-IVF paper (arXiv 2411.00970 §4) explicitly prescribes BOTH splits AND merges with a global k-cap. We implemented half the algorithm.
What "right" would look like
- Add a merge primitive that combines an underpopulated cell with its
nearest neighbour. Requires composing
update_centroidsto support merge-replacement in addition to drop-and-append. - Add a hard cap:
k_max = 2 · √N. When auto-rebuild would pushkpast this, it must MERGE before splitting, or refuse to fire and emit a "operator mustrebuild-ivf" warning. - Default
auto_rebuild=Falseand document it as a small-scale convenience, not the production maintenance path.
2. Every SI change forces O(N) record re-dispatch
Status: 🟡 Partially resolved (2026-05-15). Chain-aware lineage shipped (Phase 3.1) — SpatialIndexObject carries parents: Vec<Multihash> per spec/0004 §3.5, bucket lineage check walks the chain up to 100 ancestors. Cold-bucket skip in rebuild_all_buckets (Phase 3.2) uses the same id-map as update_centroids to identify preserved cells; those cells skip decode + redispatch entirely. Live: 35% of cells preserved on the first imagenet-100 split-only test. Outstanding: redispatch within shifted-position cells still re-PUTs identical bytes under the new spatial_key path; an address-scheme change (move spatial_key off the path, into the Track entry only) would eliminate the re-PUT — deferred to a future iteration.
Problem
After Ada-IVF or full-rebuild produces a new SI, the SDK has to ensure no record is queried against a centroid set it wasn't placed under — otherwise queries would mis-route and recall would collapse silently.
What we built
Bucket-header lineage check (spec/0007 §6.1.2). Every SpatialBucket
Object's header carries the 33-byte spatial_index_hash of the SI it
was placed under. Dataset::append_many reads each prior bucket's
header and refuses to merge if the hash differs from the current
schema's SI hash. The error suggests "re-ingest from scratch into a
fresh Ref."
New issue surfaced
Any centroid change requires rewriting every bucket in the dataset. Even when only one hot cell needs splitting (touching, say, 0.1% of records), the other 99.9% must be GET+decode+re-bucket+PUT just so their bucket headers carry the new SI hash. We measured this cost on the imagenet-100 recreate: each Ada-IVF step re-dispatched all 131K records (~30 s), and at 10B records it would take hours.
Worse: the re-dispatch goes through vc.decode(...), which for RaBitQ
produces an approximate f32 reconstruction. Every rebuild
compounds quantization error — records rebucketed many times drift
further from their "true" centroid.
Root cause
Lineage was modelled as a strict equality check, not as a chain. The bucket header carries one hash and there's no notion of "this SI descends from that older SI". So an updated SI is always treated as incompatible with all prior buckets.
What "right" would look like
- Bucket header carries
spatial_index_lineage: Vec<Multihash>(the chain of SI ancestors). Lineage check passes if the current SI's hash is in any ancestor'sparentschain. - SI Object gains a
parents: Vec<Multihash>field — the SIs it evolves from. - For cells whose centroid was UNCHANGED across the SI update, the bucket needs no rewrite. For cells whose centroid was REPLACED, records must still be re-dispatched (correctness) — but only those records, not the whole dataset.
This is a spec-level change. It probably also requires the SI to record which centroid indices changed between ancestor and self, so SDKs can mechanically compute "is bucket X still valid".
3. Inline auto-rebuild blocks the writer
Status: ✅ Resolved (2026-05-15). Deleted entirely (Phase 1). The "inline" framing was wrong — maintenance is now async via operator-scheduled CLI. The throughput collapse from 280/s → 52/s observed during the recreate run was caused by this; without inline rebuilds, ingest throughput returns to peak rates (limited only by per-batch IvfCosine.hash_vector cost at high k, which is a separate concern bounded by Phase 3.2).
Problem
We wanted the append path to fix imbalance on its own. Async background maintenance would require a daemon, which conflicts with DreamDB's no-daemon design (cron / k8s CronJob / GitHub Actions instead).
What we built
Dataset::ada_ivf_step_inline runs inside append_many between the
bucket-consolidation loop and the Manifest publish. If imbalance crosses
threshold AND N ≤ max_n, it fires synchronously: GET every bucket,
decode, redispatch, PUT new buckets, publish new SI, then continue with
the normal Manifest publish.
New issue surfaced
The writer stalls for the full duration of the rebuild. At 131K
records this was ~30 seconds; at 1M records it'd be ~5 minutes; at
10M (our self-imposed max_n) it'd be ~50 minutes. The max_n knob
was supposed to bound this — but a 50-minute "automatic" stall is
not what users expect from a streaming append API. Above max_n we
fall back to an eprintln! warning, which most callers won't see.
Root cause
The fundamental tension: DreamDB's no-daemon stance + immutability +
content-addressing means "background work" must come from external
schedulers. There's no in-protocol way to defer work without somebody
running the deferred job. Inline was the only way to keep auto_rebuild
self-contained — but inline means synchronous.
What "right" would look like
- Two changes:
- The "imbalance check at append time" is fine (cheap). It should
emit a Manifest-registry signal (
dreamdb.recommendations) that external monitoring picks up and triggersada-ivf-stepasync. - The auto-rebuild firing should be ripped out. Replace it with "tell the operator to schedule a rebuild" — even at small scale.
- The "imbalance check at append time" is fine (cheap). It should
emit a Manifest-registry signal (
- Or accept that some users will want "automatic" and provide a
separate
dreamdb-cli watchcommand that polls Manifests and runsada-ivf-stepwhen recommendations land. That's a daemon, but it's an OPT-IN one external to the protocol.
4. Per-batch cost scales O(k·dim)
Status: 🟡 Partially resolved (2026-05-15). The merge step from flaw §1's fix now bounds k growth, addressing the root cause. The intrinsic O(k·dim) cost remains (it's how IVF works), but k now stays near √N for healthy workloads. Future: parallelize hash_vector via rayon (~30 LOC, would 5-8× the dispatch throughput on multi-core). IMI partitioning (already in protocol, not used in production datasets) would also help at extreme k.
Problem
IVF dispatch needs to compute, for each new record, which of the k centroids it's closest to. This is the natural mechanic.
What we built
IvfCosine::hash_vector does k dot products of dim-d vectors per
record. Single-threaded f32 left-fold per spec/0004 §5.4.
New issue surfaced
At inflated k (say 16,733 at dim 512), per-record hash cost is ~3 ms. Per 256-sample batch that's ~770 ms — just for dispatch. Combined with merge-on-write HTTP (~2 s/batch at k=16K) and CLIP encode (~100 ms), the per-batch cost climbs to ~3 s and throughput drops to ~85 samples/s. We observed this during the imagenet-100 recreate.
Root cause
Linear-in-k cost is intrinsic to flat IVF. The reason it hurts is flaw §1: auto-rebuild inflated k far past √N. With a properly-sized k (≈ 363 for 131K records), the cost would be ~25× cheaper and the ingest would run at 500-2000 samples/s.
What "right" would look like
- Fix flaw §1 (cap k growth).
- Optionally adopt IMI (Inverted Multi-Index) for the partitioning,
which factorizes the k-dimensional centroid lookup into 2 × √k
half-space lookups. Spec already defines
dreamdb.imi-cosinefor this purpose; we're not using it in production datasets. - Multi-thread the dispatch via
rayon: each batch of 256 records can be hashed in parallel, dropping the ~770 ms to ~100 ms on an 8-core machine.
5. Sharded ada-ivf-step is half a solution
Status: 🔵 Unblocked (2026-05-15). Chain-aware lineage + cold-bucket skip mean the redispatch step is now O(touched cells), not O(N). The orchestrator's record-redispatch in rebuild_all_buckets was also parallelized via buffer_unordered(16) for the fetch step. Implementing true sharded redispatch (workers handle their slice's replaced cells in parallel, orchestrator stitches paged Track leaves) is now bounded code, not architectural redesign. Deferred to Phase 3.3.
Problem
Single-machine ada-ivf-step on 10B records would take ~3 hours. We
wanted a path that scales horizontally across k8s pods.
What we built
Two-stage sharded mode:
- Workers (
--shard N --of M --job-id X): each worker claims hot cells wherecell_id % M == N, decodes their records, runslocal_split, writes shard JSON at<bucket>/_ada_ivf/<job-id>/centroids/shard-NNNN.json. - Orchestrator (
--orchestrate --job-id X): reads all shard JSONs, aggregates centroid replacements, publishes new SI, then re-dispatches EVERY record (single-machine), publishes Manifest, CAS the Ref.
New issue surfaced
Only the centroid-computation step is parallelized. The expensive step — record re-dispatch — runs serially on the orchestrator. At 10B records, decode + hash_vector + bucket re-PUT is the bottleneck regardless of how many workers we have. The sharded mode's wall-clock improvement is ~30% (the local_split portion), not 100× as the parallel-workers naming implies.
A second stage of sharding could distribute the redispatch (each worker handles records in cells it owns), but that adds a second synchronization barrier (workers need to know new SI before dispatching → orchestrator publishes SI → second worker pass dispatches → second orchestrator pass finalizes Manifest). Four sequential k8s Jobs.
Root cause
The orchestrator HAS to single-thread the re-dispatch because the
final Track is one CBOR Object. Parallel workers can each produce
sub-buckets, but assembling them into a single Track is serial. To
truly distribute, the Track would need to be paged (B-tree of leaf
pages), and our ada-ivf-step doesn't yet support paged tracks
(flaw §6).
What "right" would look like
- Phase-3 sharded redispatch: each worker handles records in cells it owns, after orchestrator publishes new SI. Each worker emits a paged-track LEAF Object. Final orchestrator stage assembles leaves into a B-tree.
- Paged-Track support in
update_centroidsand bucket re-dispatch paths. Currently rejected outright (flaw §6).
6. Paged tracks aren't supported by rebuild verbs
Status: ✅ Resolved (2026-05-15, Phase 3.4). ada-ivf-step now READS paged TrackObjects via B-tree walk and WRITES paged TrackObjects via bottom-up B-tree build (leaf=1000 entries, fanout=100). Inline-vs-paged decision auto-fires at 8000-entry threshold (~960 KB). Combined with chain-aware lineage + cold-bucket skip, rebuilds at 1B-cell scale are now O(touched cells) for the bucket pass plus O(N) for the Track B-tree rebuild — the latter is the natural Phase 3.4b target (incremental B-tree update for cells whose entries didn't change).
Problem
At ~10K inline track entries the Manifest's inline-array form crosses
1 MiB and per spec/0002 §7.2.2 the track switches to a paged B-tree.
This is necessary at 1B-record scale.
What we built
Dataset::append_many handles paged tracks (B-tree maintenance is in
dreamdb-protocol). But dreamdb-cli ada-ivf-step and the inline
auto-rebuild path BOTH bail out with "paged tracks not yet supported"
when they encounter one.
New issue surfaced
Maintenance verbs become unavailable at exactly the scale where you
most need them. A 1B-record dataset has tens of thousands of
populated cells → its track is paged → ada-ivf-step refuses to run
→ the only path is rebuild-ivf from scratch.
Root cause
Paged-track read/write requires walking a B-tree of Index Pages, splitting/merging on insert, etc. Substantial code that the rebuild verbs need but didn't get because the simpler inline path was enough for our test scale.
What "right" would look like
Implement paged-track support in ada_ivf_step.rs and
ada_ivf_step_inline. Adds maybe ~200 LOC. Without it, the entire
maintenance story is "works at 1M records, broken at 1B" — the
inverse of what DreamDB claims.
7. ada-ivf-status lies about imbalance
Status: ✅ Resolved (2026-05-15, Phase 2.1). ada-ivf-status now reads the current Manifest's Track instead of LIST-PREFIX. Output adds underpopulated_partitions so operators can size the merge step. Verified on imagenet-100: pre-fix reported 100K buckets across 42K populated partitions (counting historical orphans); post-fix reports the accurate 24K buckets across 21K populated partitions.
Problem
Operators want a cheap way to check "should I run ada-ivf-step?"
without fetching the current Manifest + Track.
What we built
dreamdb-cli ada-ivf-status walks the LIST-PREFIX of the modality's
spatial-key space (<timeline>/<modality>/), counts every bucket
Object it sees, computes per-cell counts and imbalance.
New issue surfaced
List-prefix sees every historical bucket, not just live ones. After
the imagenet-100 recreate the current Track had 16,702 live buckets
but ada-ivf-status reported 100,347 buckets across 42,575 partitions.
The imbalance score it produced (1.21) was computed over a fictional
distribution that mixed live + dead buckets.
After the manual GC pass (87K orphans deleted) the status was correct,
but only because GC happened to run. Without GC, ada-ivf-status is
permanently wrong on any dataset that has ever been rebuilt.
Root cause
We chose the "cheap" implementation (list-prefix, no Manifest fetch) over the correct one (resolve Ref → Manifest → Track → walk entries). The cheap path looks right on a fresh dataset but accumulates lies over the dataset's lifetime.
What "right" would look like
Replace the list-prefix scan with a Track walk. Costs one extra HTTP
GET (the Manifest) plus one per Track Object. Trivial. Should have
been the original implementation. The current ada_ivf_step code in
single-machine mode already walks the Track — that's the right code,
just split out into its own verb.
8. GC requires manual scripting
Status: ✅ Resolved (2026-05-15, Phase 2.4). dreamdb-cli gc is a first-class verb with --keep-manifests N, --keep-since DURATION, --dry-run. Parallelizes HEAD requests via buffer_unordered(32) and DELETE requests too. Verified: deleted 22,501 orphans from imagenet-100, dataset still resolves correctly. Outstanding nuance: walks parents[0] only, so multi-parent merge histories aren't fully preserved (out of scope until Phase 3.5 ships multi-parent merge).
Problem
DreamDB is immutable and append-only. Every rebuild, every append, every schema migration produces NEW Objects without removing OLD ones. Over time the bucket fills with orphans.
What we built
A spec definition (spec/0006 §7.3) of mark-and-sweep GC with a 24h
Last-Modified threshold — and nothing else. The sample script we wrote
this session (/tmp/dreamdb_gc.py, 175 lines of Python with boto3) is
the entire implementation.
New issue surfaced
Operators have no path to bounded storage. Without GC, every rebuild multiplies bucket count. On a 1-year-old dataset with daily rebuilds you'd have 365× more bucket Objects on disk than are reachable from the current Manifest. List operations slow down O(historical Objects) linearly.
The manual script we wrote walks only the CURRENT Manifest — meaning running it destroys all time-travel snapshots. So the operator has to choose between "infinite storage growth" and "no time-travel". That's not a real choice.
Root cause
GC is on the spec roadmap (spec/0006 §7.3) but isn't implemented and
there's no dreamdb gc CLI verb. The spec defines the discipline but
not the retention policy: "what manifests should be preserved" is
left to the operator.
What "right" would look like
dreamdb gc --keep-manifests=N— keep the N most recent Manifests reachable from each Ref, GC everything else. Default N=100 so daily rebuilds get 100 days of time-travel.dreamdb gc --keep-since=24h— preserve everything modified within 24 hours (the spec's safety threshold).- Combined: respect both filters; never delete an Object that's still reachable from any preserved Manifest.
9. No deletion / tombstones
Status: ✅ Resolved (2026-05-18, B8 in 10B-scale push). spec/0020 defines TombstoneListObject (anchor-keyed, parent-DAG, canonical CBOR). Dataset::delete(&[u64], reason) + Dataset::tombstone_set() + dreamdb delete CLI ship the operator surface. Read paths (iter_with_fields, iter_stream) auto-consult the tombstone set; deleted anchors disappear from queries without rewriting the underlying Track. Storage compaction (reclaim bytes for tombstoned records) is deferred per spec/0020 §6. 9 new tests, 721 total green.
Problem
DreamDB is append-only by design — append, never mutate.
What we built
... nothing. The protocol has no deletion verb. The spec doesn't address it.
New issue surfaced
Records that were ingested wrong (corrupted CLIP embedding, bad parquet row, GDPR-mandated deletion) cannot be removed. The only workaround is to read the current Track, build a new Track that excludes the unwanted records, and publish a new Manifest pointing at it. This is a full data rewrite for one deletion.
Root cause
Append-only is the protocol's defining design choice. It's what unlocks content-addressing, time-travel, and lock-free reads. But it has no story for the "this row shouldn't exist" use case that real production systems hit weekly.
What "right" would look like
dreamdb.tombstonesregistry entry: per-modality, a list of(track_position, anchor_hash)pairs marking deleted records. Query path skips records whose ordinal+hash matches.- Eventually-compacting: on the next rebuild, the rebuilt Track omits tombstoned records entirely.
- GDPR-compliant: the original record's bytes are NOT deleted from the content store (it's content-addressed, possibly shared) but the path from any Ref to those bytes is severed.
This needs spec work.
10. Append + rebuild conflict has no recovery path
Status: ✅ Resolved (2026-05-18, B2 in 10B-scale push). Dataset::branch(name), Dataset::merge(other, MergeStrategy::FastForward), and MergeStrategy::UnionTracks (3-way fused-merge with LCA walk + per-cell bucket reconciliation) all ship. Dataset::merge_many(&[branches]) + dreamdb merge-many CLI orchestrate N-way sharded ingest. The protocol-level framing (layered-merge vs fused-merge) is now in spec/0008 §5.3; the algorithm lives in design/0007-sharded-ingest.md. Outstanding: Fragment/Scalar/Constant union-merge (currently the algorithm refuses non-embedding diverged tracks — v0.1 extension).
Problem
Concurrent append and rebuild on the same Ref would race the SI swap: appender places records under old SI; rebuilder publishes new SI; final Ref CAS by whichever loses → records are silently mis-routed.
What we built
SpatialIndex conflict = MUST-REFUSE merge (per project_collab_disciplines.md).
Both writers do CAS; one wins; the other gets CasFailed and bails
out. Documentation says "use a feature branch".
New issue surfaced
Dataset.branch() is not implemented. The advice to "use a feature
branch" has no API. Operators who hit the conflict have nothing they
can do programmatically — they have to manually create a new Ref via
the connector and figure out coordination themselves.
Root cause
Refs are 33-byte content pointers under <bucket>/refs/<name>. Branching
should just be "create a new Ref pointing at the current Manifest" —
one PUT. The mechanics are trivial; the API surface and merge story
isn't there.
What "right" would look like
Dataset::branch(new_ref_name) -> Result<Dataset>— PUT a new Ref at current Manifest hash; return a Dataset bound to it.Dataset::merge(other_ref, strategy: MergeStrategy) -> Result<()>— build a new Manifest withparents=[self.tip, other.tip]. Strategy options: refuse on SI conflict, fast-forward only, etc.- These belong in
dreamdb-dataset/src/dataset.rsnext tocreate/open.
11. Decode-on-rebuild compounds quantization error
Status: 🟡 Partially mitigated (2026-05-15). Cold-bucket skip (Phase 3.2) means cells whose centroid is preserved never get re-decoded — those records sit on disk indefinitely with the same compressed codes. Only the replaced cells' records pass through vc.decode → hash_vector → re-encode. So the compounding only hits records in cells that actually changed, not every record on every rebuild. Materially better but not perfect. Real fix is rerank=True schemas (raw f32 stored alongside codes) — already shipped, just not used on the imagenet-100 demo.
Problem
The rebuild verbs (ada-ivf-step and the inline auto-rebuild) need to
re-dispatch records — meaning they need each record's f32 vector to
compute its new centroid id.
What we built
For each record we call vc.decode(codes) to recover an approximate
f32 vector via inverse rotation of the RaBitQ codes. That's then fed
into IvfCosine::hash_vector(...) to get the new spatial key.
New issue surfaced
1-bit RaBitQ decode is approximate (per-dim is ±scale, not the original f32 value). Records that are rebucketed many times drift further from where they "should" be — each rebuild's decoded f32 is already lossy, so the new spatial_key it's assigned is computed from a worse approximation than the previous one.
Empirically OK at our scale (recall stayed >95% after 10+ rebuilds) but a real failure mode at long-lived, heavily-rebuilt datasets where the same records sit at the same cell across many rebuild cycles.
Root cause
We store ONLY the compressed codes; the original f32 is gone after encoding. Recovering f32 from 1-bit codes is fundamentally lossy.
What "right" would look like
- The two-pass-rerank schema (
rerank=True) keeps raw f32 alongside compressed codes in a parallel VectorStorage. Rebuilds on rerank-on datasets can use the RAW vectors for re-dispatch — exact, not approximate. - Without rerank, the only fix is to NOT decode + redispatch — i.e.
abandon Ada-IVF for raw datasets and only support
rebuild-ivf(a full retrain that uses fresh data, not decoded data).
12. Hidden cost: merge-on-write HTTP round-trips per batch
Status: ⏳ Outstanding (but bounded). Still ~N HTTP GET + N HTTP PUT per batch for the N cells touched. With k now bounded near √N (flaw §1 fix), the average batch touches fewer cells than before; the throughput collapse from k inflation is gone. Future: backend-side compose (MinIO composeObject) or partitioned-bucket super-objects (spec/0007 has the concept but no implementation). Not on the critical path.
Problem
Earlier versions of DreamDB emitted one bucket per ingest batch per
spatial key, producing many small buckets per cell (many_batches_consolidate_to_one_bucket_per_cell test caught this). We fixed it.
What we built
Per project_bucket_consolidation.md: each append_many reads every
prior bucket for each cell it's about to write, merges with new
records, and writes ONE consolidated bucket per cell.
New issue surfaced
Every batch pays N HTTP GET + N HTTP PUT for the N cells it touches. At dense ingest into a high-k dataset, a typical 256-sample batch spreads across ~200 cells. That's ~200 GETs + ~200 PUTs to MinIO, ~10 ms each, = ~4 seconds of HTTP for one batch. Throughput-bound at ~64 samples/s.
We measured this during the recreate at k=16,733. The HTTP overhead
matches IvfCosine::hash_vector cost almost exactly — both scale
with k.
Root cause
Each bucket is one Object. We don't have a way to write/merge multiple
buckets in one HTTP request. The backend connector's
put_multi and get_multi_range aren't used for bucket batches.
What "right" would look like
- Batch bucket reads via
get_multi_rangeagainst a "bucket super-object" — if buckets for many cells share a backing Object (with offset tables), one GET fetches them all. Spec/0007 has the partitioned-bucket concept but it isn't wired up for the typical case. - Or: backend-side multi-PUT. S3 doesn't natively support this but
MinIO does via
composeObject. The connector could compose multiple bucket payloads server-side.
This is a future optimization; not a correctness issue. But it's a big chunk of the observed ingest slowdown.
Progress summary (2026-05-15)
| # | Flaw | Status |
|---|---|---|
| 1 | Auto-rebuild additive-only | ✅ Resolved (Phase 1 delete + Phase 2.2 merge) |
| 2 | O(N) re-dispatch on every SI change | 🟡 Partially resolved (Phase 3.1 + 3.2) |
| 3 | Inline auto-rebuild blocks writer | ✅ Resolved (Phase 1) |
| 4 | Per-batch cost O(k·dim) | 🟡 Partially resolved (k bounded by Phase 2.2) |
| 5 | Sharded ada-ivf-step half a solution | 🔵 Unblocked by Phase 3.1+3.2 |
| 6 | Paged tracks unsupported in rebuild | 🔵 Unblocked by Phase 3.1+3.2 |
| 7 | ada-ivf-status lies about imbalance | ✅ Resolved (Phase 2.1) |
| 8 | GC requires manual scripting | ✅ Resolved (Phase 2.4) |
| 9 | No deletion / tombstones | ⏳ Outstanding |
| 10 | Append + rebuild conflict no recovery | 🟡 Partially resolved (Phase 2.3) |
| 11 | Decode-on-rebuild quantization drift | 🟡 Partially mitigated by Phase 3.2 |
| 12 | Per-batch merge-on-write HTTP overhead | ⏳ Outstanding (bounded) |
5 resolved, 4 partially resolved, 2 unblocked, 3 outstanding (one outstanding is bounded enough not to be on the critical path; the other two — tombstones and full-multi-parent merge — are spec-level work for Phase 3.5+).
Meta-pattern: layers vs. roots
Reviewing the list, almost every flaw fits one of three patterns:
A. "Maintenance is async work" forced inline. Flaws §1, §3, §8 are all the same shape: we wanted a thing to happen "automatically" (rebuilds, GC, retraining) and the no-daemon stance forced it inline or onto the operator. The result is either stalls the writer or never runs.
The root: DreamDB's architectural decision to treat maintenance as
external (cron / k8s / GitHub Actions) is the right call for a
protocol, but the SDK keeps trying to be "self-healing" anyway.
Either embrace the external-scheduler stance and remove the inline
auto-rebuild, or build the missing scheduler primitive (e.g. a
dreamdb watch command that's a single-process loop polling
recommendations and running them).
B. "Strict equality" lineage that should be a chain.
Flaws §2, §6, §10 are all the same shape: bucket lineage is checked
as == against the current SI, with no concept of "this SI descends
from that one". The result is full data rewrites on every centroid
change, paged-track maintenance verbs being stubbed out, and no
sane multi-writer collaboration.
The root: spec/0007's lineage check needs to be hash-chain aware.
Add parents: Vec<Multihash> to SI Objects, change bucket lineage
check to "is current SI's hash in any ancestor". This single change
unlocks several of the listed flaws.
C. "Cheap" implementations that look right but accumulate lies. Flaws §7, §8, §11 are all the same shape: we shipped the simpler implementation (list-prefix counting, no GC, decode-and-redispatch) which works in isolation but accumulates pathologies as the dataset ages.
The root: every implementation should ask "what does this look like on a dataset that's been alive for 1 year?" The straightforward implementations only consider the day-1 dataset.
Honest current status
Update 2026-05-15 (later): After the same-day fix sweep (Phases 1, 2.1-2.4, 3.1, 3.2), the maintenance/operations layer is structurally healthy. Most of the bullets below have moved from "broken" to "shippable" or "addressed." The list as originally written still shows the design philosophy this work proved out: the protocol layer was correct; the maintenance layer needed scope-rationalization and the lineage-chain primitive. Both shipped today.
Honest current status (as originally written, preserved for the chain of reasoning)
What is solid:
- Immutability + content-addressing
- Time-travel for unmodified datasets (broken once we GC, but that's the chain-lineage fix)
- RaBitQ encoding correctness (bit-identical Rust + JS)
- Read-online property during rebuilds (queries always see consistent state)
- Spec discipline
What is not ready for production:
- Any dataset that needs to survive >1 year of growth (GC + lineage rigidity)
- Any dataset that needs deletion (flaw §9)
- Any dataset at 100M+ records that needs maintenance (flaws §5, §6)
- Any team workflow with >1 writer (flaw §10)
- Any deployment with strict latency budgets on append (flaw §3)
The protocol layer is well-grounded. The maintenance/operations layer above it has consistent gaps that all trace back to either the no-daemon stance not being fully embraced, the lineage check not being chain-aware, or the rush to ship the cheap path. The fixes are known; they're each on the order of a focused 1-2 week iteration. What's harder is admitting that some of what we shipped (specifically inline auto-rebuild) is the wrong abstraction and should be removed, not patched.