DreamDB Scope Boundaries — what's protocol, what's app
2026-05-15. Companion to design/0002-known-flaws-retrospective.md.
Status as of 2026-05-18: the four-phase migration described below is largely complete. Phase 1 (delete inline auto_rebuild and friends) shipped 2026-05-15. Phase 2 (operator mechanisms: ada-ivf-step --merge, Dataset::branch + merge, dreamdb-cli gc) shipped through 2026-05-15 to 2026-05-18 (B2 + B3 + B4 in the 10B push). Phase 3 (spec changes) is partially done — spec/0008 §5.3 documents multi-parent merge framings; full promotion of
design/0007-sharded-ingest.mdto a numbered spec is the remaining piece. Phase 4 (operator examples — k8s YAMLs) is partial; an ada-ivf-step example exists, sharded-ingest YAML is pending.
The flaws retrospective identified that almost every DreamDB problem traces back to one of three architectural tensions, all of which boil down to the same root: we keep pulling operator/app concerns into the protocol layer. This document draws the boundary explicitly, with the lens "DreamDB is to vector databases what git is to version control — a content-addressed plumbing layer, not a user-facing application."
The goal isn't to make DreamDB smaller. It's to make the layers above
DreamDB possible. Right now dreamdb-dataset carries policy
(auto_rebuild=True, threshold defaults, density gates) that should
live in the app calling it. That coupling makes both layers worse: the
SDK is full of half-built scheduling, and apps can't build their OWN
scheduling without fighting the SDK's.
The four layers
Where things have been mis-placed:
Schema.auto_rebuild=Truelives in Layer 2 but it's a Layer 3 policy decision (when to trigger maintenance). It should be deleted from Layer 2 entirely.Dataset::ada_ivf_step_inlineruns maintenance inside an append call. Maintenance is Layer 3; appends are Layer 2. The two should never share a thread.1.5threshold,10/celldensity gate,10_000_000max_n are hardcoded in Layer 2. These are Layer 3 knobs.- "Use a feature branch" is documented as the resolution for concurrent
appends + rebuild — but
Dataset::branch()isn't implemented. The protocol provides the MECHANISM (refs are by-name pointers) but the SDK doesn't expose it as a verb, so apps can't actually do this.
The "mechanism vs policy" rule
Every feature should answer ONE of these two questions, never both:
| Mechanism (Layer 1 + 2) | Policy (Layer 3 + 4) |
|---|---|
| "What CAN happen?" | "What SHOULD happen now?" |
| "How is X represented?" | "When is X needed?" |
| "Given inputs, produce outputs." | "Given a goal, choose inputs." |
| Stateless, deterministic. | Stateful, context-dependent. |
| Reusable across deployments. | Specific to a deployment. |
Auto-rebuild fails this test cleanly: it answers "WHEN to rebuild"
(policy), not "HOW to rebuild" (mechanism). The mechanism
(ada-ivf-step CLI verb) is correctly placed. The decision to fire
it should never have lived in the SDK.
Concrete scope: what DreamDB does
Layer 1 — Protocol (the spec)
MUST define:
- Object kinds: Genesis, Manifest, Ref, Track, SpatialIndex, VectorCompressor, SpatialBucket, Fragment, ItemManifest, VectorStorage, ScalarBucket, IndexPage, GraphIndex, GraphPage
- For each, the canonical CBOR shape and the content-hash rule
- The Ref → Manifest → Track → Item resolution chain
- The Manifest parents DAG (time-travel + collaboration semantics)
- Lineage rules (which Objects' hashes appear in which others' headers)
- The conformance test corpus
MUST NOT touch:
- When to publish a new Manifest (policy)
- How often to GC (policy)
- Bucket size, batch size, k value (policy)
- Auth, encryption, multi-tenant isolation (operator or app)
- Query semantics ABOVE the dispatch layer ("most relevant" = cosine vs euclidean vs hybrid = policy)
Layer 2 — SDK / reference implementation
MUST provide verbs:
Dataset::create/open/append_many/iter/queryConnector::get/put/list_prefix/delete/headSessionfor cached lookupsdreamdb-cli:rebuild-ivf,publish-rabitq,ada-ivf-step(split- merge),
ada-ivf-status,gc,branch,merge,inspect
- merge),
MUST emit signals — not act on them:
- Imbalance score after each append (return as part of
AppendResultor write to Manifest'sdreamdb.recommendationsregistry) - GC candidate count (
ada-ivf-status-style verbs report; don't act) - Per-cell record density (for operator's k-target calculation)
- Bucket fragmentation level
MUST NOT do:
- Schedule its own work (no daemons, no inline rebuilds, no inline GC)
- Carry user policy state (no
auto_rebuild=Trueschema flags) - Make decisions on the operator's behalf (no "if imbalance > 1.5 then rebuild" — instead: "imbalance is 1.5, here's the signal")
Layer 3 — Operator tools
Provides:
- Cron entry / k8s CronJob / GitHub Actions workflow that calls
dreamdb-cliverbs on schedule - Monitoring integration: scrape
ada-ivf-statusoutput, emit Prometheus metrics, page when threshold crossed - Retention policy: how many Manifests to keep, how aggressive to GC
- Capacity policy: when to rebuild-ivf vs ada-ivf-step
- Multi-region replication: which buckets to replicate, on what cadence
Out of scope for DreamDB: these are off-the-shelf tools (k8s, Prom, Argo, etc.). DreamDB just needs to BE schedulable — every maintenance operation must be a single shell command that exits with a clear status code. The CLI is the API to this layer.
Layer 4 — App
Provides:
- The user-meaningful abstractions (Workspace, Library, Project, Stream)
- UI / API / SDK that callers actually integrate with
- Auth, multi-tenancy (subject filtering on top of a shared DreamDB bucket; the app enforces "user X can only see Track Y")
- Quotas, rate limits, billing
- The "Slack-style real-time collaboration" UX, with the app coordinating writes (e.g. routing one user's writes to user-X-branch, resolving merges with semantic understanding the protocol can't have)
Out of scope for DreamDB: DreamDB provides immutable storage primitives. Whether your app uses them to build a vector DB, a time-series store, a media library, or a memory layer for an AI agent is the app's call.
What this means for the current code
Should be removed from Layer 2 (the SDK)
| Currently here | Move to | Why |
|---|---|---|
Schema.auto_rebuild, auto_rebuild_max_n, auto_rebuild_threshold | DELETE (operator decides) | Policy in protocol cloth. The operator's cron decides when to rebuild. |
Dataset::ada_ivf_step_inline | DELETE | Layer 2 should never schedule its own work. |
Density-gate hardcode (MIN_DENSITY_PER_CELL: u64 = 10) | DELETE with above | Same. |
Default threshold 1.5 | DELETE with above | Operator's threshold. |
Hardcoded max_n = 10_000_000 | DELETE with above | Operator's cap. |
Removing these reverts Dataset::append_many to a pure-mechanism call
that publishes one Manifest per batch with no side-channel
maintenance work. The throughput collapse from auto_rebuild=True
(280/s → 52/s) vanishes — it was self-inflicted.
Should be added to Layer 2 (currently missing)
| What | Why |
|---|---|
Dataset::branch(name: &str) | Mechanism for the documented "feature branch" pattern. One PUT to <bucket>/refs/<name>. |
Dataset::merge(other: &Ref, strategy: MergeStrategy) | Mechanism for combining branches. Strategy: refuse-on-conflict (default), fast-forward-only, ours, theirs. |
dreamdb-cli gc --keep-manifests=N --keep-since=DURATION | Mark-and-sweep GC verb. Currently a 175-line Python script. |
dreamdb-cli ada-ivf-step with merge support | Merge underpopulated cells. find_underpopulated_partitions exists in dreamdb-protocol/src/ada_ivf.rs; never used. Required to bound k growth (flaw §1). |
Paged-track support in ada-ivf-step | Required to maintain indexes at 10K+ cells (flaw §6). |
dreamdb-cli ada-ivf-status reading the current Track, not list-prefix | Stop lying about imbalance (flaw §7). |
Schema-migration verb: dreamdb-cli schema-update <ref> <new-cbor> | Change a Schema's flags without re-ingesting. |
Tombstone primitive in protocol + Dataset::delete verb | GDPR/correction story (flaw §9). Spec-level. |
Chain-aware lineage: SI carries parents, bucket lineage check walks chain | The single highest-leverage fix. Unlocks flaws §2, §5, §6, §10. Spec-level. |
Should be added to Layer 3 (currently missing)
These are the policy/scheduling pieces that aren't DreamDB's job but need EXAMPLES so users know how to set them up:
| What | Where |
|---|---|
Example k8s CronJob calling ada-ivf-status + conditional ada-ivf-step | dreamdb-cli/examples/ada-ivf-cron.yaml |
Example k8s CronJob calling gc --keep-since=7d daily | dreamdb-cli/examples/gc-cron.yaml |
| Example Prometheus exporter scraping CLI output | dreamdb-cli/examples/prom-exporter.sh |
| Example Argo workflow for full rebuild + verify | dreamdb-cli/examples/rebuild-workflow.yaml |
The k8s YAML we already wrote (ada-ivf-step.yaml) is one of these.
Notice it's an EXAMPLE in dreamdb-cli/examples/, not a verb. That's
the right placement.
Should be added to Layer 4 (out of scope for us, but worth naming)
Users who build apps on DreamDB will need these. DreamDB shouldn't provide them; it should DOCUMENT that they're missing so app builders don't expect them from us:
- User identity / auth
- Per-user / per-team rate limits
- Quota enforcement
- Multi-tenant isolation
- Real-time pub/sub for "new appends arrived"
- Search-result ranking that uses domain knowledge (e.g. recency boosts, category filtering with semantic meaning)
- A web UI / mobile UI / API gateway
The current browse.html demo is a Layer 4 app for the imagenet-100
demo. It belongs in an examples/ directory, not in the protocol or
SDK. (It currently is in dreamdb-dataset-python/examples/web/, which
is correct.)
Migration plan — how to actually do this
Phase 1 (1-2 days): remove the wrong things
- Delete
Schema.auto_rebuildand its CBOR encode/decode paths. - Delete
Dataset::ada_ivf_step_inlinefromappend_many. - Delete the density gate, threshold default, max_n constant.
- Run all tests. The
auto_rebuild_*integration tests delete. - Update Python bindings (remove three kwargs from
add_embedding). - Memory updates: mark
project_auto_rebuild.mdas deprecated; pin a new memory explaining why.
After this, Dataset::append_many is pure-mechanism. Ingest throughput
returns to ~500-2000/s (the IvfCosine.hash_vector + merge-on-write
cost minus the rebuild stalls).
Phase 2 (1 week): add the missing mechanisms
Dataset::branch(name)+Dataset::merge(other, strategy)indreamdb-dataset/src/dataset.rs. Simple — one PUT each, plus the merge-strategy logic.dreamdb-cli gcverb (port the Python script to Rust; expose--keep-manifests=N --keep-since=DURATION).ada-ivf-stepwith merge support (thefind_underpopulated_partitionsprimitive already exists indreamdb-protocol/src/ada_ivf.rs; just wire it through).ada-ivf-statusreading the current Track (~30 LOC change).
Phase 3 (2-3 weeks): the spec changes
- SI Object gets
parents: Vec<Multihash>. Bucket lineage check walks the chain. spec/0007 amendment. - Tombstones primitive: define the CBOR shape, query semantics, and GC interaction. spec/0009 amendment? or a new spec/0020.
- Schema-migration verb: define what it can change without re-ingesting.
Phase 4 (ongoing): operator examples
Build out the dreamdb-cli/examples/ directory with the cron / k8s /
Argo recipes. Each one is a 50-200 line file with a comment block
explaining the policy decision it implements. These ARE the
documentation for "how do I run DreamDB in production."
The hardest part
Phase 1 is technically trivial (delete code) and emotionally hard. We
just shipped auto_rebuild=True to address a real user concern
("appends should self-heal"). Deleting it admits that the
implementation was the wrong shape — that the real concern was a
Layer 3 scheduling problem, not a Layer 2 SDK problem.
The honest framing for the doc: "we don't make automation worse for
small datasets; we make automation possible for ALL datasets by
moving it to the right layer." Small users get a k8s CronJob template
instead of an SDK flag. The CronJob template is in our repo. They run
kubectl apply. They get the same outcome, on the right side of the
layer boundary.
How to use this doc
When evaluating a future DreamDB feature, ask:
- Does it answer "what CAN happen" or "what SHOULD happen now"?
- If "what should happen now" — it's the operator's or app's job, not ours.
- If "what can happen" — fine, but is the SDK the right place, or does it belong in the protocol spec?
- Does it carry state that varies across deployments? If yes, it's policy (Layer 3 or 4), not mechanism.
- Does it schedule its own work? If yes, you're building a daemon inside the SDK. Don't.
The single sentence: DreamDB provides mechanisms and signals; layers above DreamDB decide what to do with them.