DreamDBv0.2.0bec026

DreamDB Scope Boundaries — what's protocol, what's app

2026-05-15. Companion to design/0002-known-flaws-retrospective.md.

Status as of 2026-05-18: the four-phase migration described below is largely complete. Phase 1 (delete inline auto_rebuild and friends) shipped 2026-05-15. Phase 2 (operator mechanisms: ada-ivf-step --merge, Dataset::branch + merge, dreamdb-cli gc) shipped through 2026-05-15 to 2026-05-18 (B2 + B3 + B4 in the 10B push). Phase 3 (spec changes) is partially done — spec/0008 §5.3 documents multi-parent merge framings; full promotion of design/0007-sharded-ingest.md to a numbered spec is the remaining piece. Phase 4 (operator examples — k8s YAMLs) is partial; an ada-ivf-step example exists, sharded-ingest YAML is pending.

The flaws retrospective identified that almost every DreamDB problem traces back to one of three architectural tensions, all of which boil down to the same root: we keep pulling operator/app concerns into the protocol layer. This document draws the boundary explicitly, with the lens "DreamDB is to vector databases what git is to version control — a content-addressed plumbing layer, not a user-facing application."

The goal isn't to make DreamDB smaller. It's to make the layers above DreamDB possible. Right now dreamdb-dataset carries policy (auto_rebuild=True, threshold defaults, density gates) that should live in the app calling it. That coupling makes both layers worse: the SDK is full of half-built scheduling, and apps can't build their OWN scheduling without fighting the SDK's.

The four layers

┌─────────────────────────────────────────────────────────────────┐
│  Layer 4 — App                                                  │
│  • UI / web service / ingestion pipeline                        │
│  • Decides WHEN to ingest, WHAT to ingest                       │
│  • Owns retry / backoff / batching strategy                     │
│  • Defines user-meaningful concepts (Space, Library, Workspace) │
│  • Renders results, handles auth, owns the user model           │
└────────────────────────┬────────────────────────────────────────┘
                         │  uses SDK verbs + reads metrics
┌────────────────────────▼────────────────────────────────────────┐
│  Layer 3 — Operator                                             │
│  • Cron / k8s CronJob / GitHub Actions                          │
│  • Schedules maintenance: rebuild-ivf, ada-ivf-step, GC          │
│  • Owns thresholds, retention policy, alerting                   │
│  • Bridges policy decisions to SDK mechanisms                    │
└────────────────────────┬────────────────────────────────────────┘
                         │  invokes dreamdb-cli + monitors signals
┌────────────────────────▼────────────────────────────────────────┐
│  Layer 2 — SDK / reference implementation                       │
│  • dreamdb-dataset, dreamdb-protocol, dreamdb-cli, dreamdb-connector│
│  • Implements the verbs (Open, Append, Get, Query, Stream)       │
│  • Provides mechanisms but does NOT enforce policy               │
│  • Emits signals (imbalance score, GC candidates) for upper      │
│    layers to act on                                              │
└────────────────────────┬────────────────────────────────────────┘
                         │  speaks the protocol over HTTP
┌────────────────────────▼────────────────────────────────────────┐
│  Layer 1 — Protocol (spec/)                                     │
│  • Object types, CBOR shapes, content hashes                    │
│  • Address grammar, Manifest/Ref/Track DAG, lineage rules        │
│  • Append semantics (CAS), read consistency model                │
│  • Algorithm self-description (an SI Object describes itself)    │
│  • Conformance test discipline                                   │
└─────────────────────────────────────────────────────────────────┘
                         │  HTTP/S3
                  ┌──────▼──────┐
                  │ Object Store│  (MinIO, S3, GCS, Azure Blob)
                  └─────────────┘

Where things have been mis-placed:

  • Schema.auto_rebuild=True lives in Layer 2 but it's a Layer 3 policy decision (when to trigger maintenance). It should be deleted from Layer 2 entirely.
  • Dataset::ada_ivf_step_inline runs maintenance inside an append call. Maintenance is Layer 3; appends are Layer 2. The two should never share a thread.
  • 1.5 threshold, 10/cell density gate, 10_000_000 max_n are hardcoded in Layer 2. These are Layer 3 knobs.
  • "Use a feature branch" is documented as the resolution for concurrent appends + rebuild — but Dataset::branch() isn't implemented. The protocol provides the MECHANISM (refs are by-name pointers) but the SDK doesn't expose it as a verb, so apps can't actually do this.

The "mechanism vs policy" rule

Every feature should answer ONE of these two questions, never both:

Mechanism (Layer 1 + 2)Policy (Layer 3 + 4)
"What CAN happen?""What SHOULD happen now?"
"How is X represented?""When is X needed?"
"Given inputs, produce outputs.""Given a goal, choose inputs."
Stateless, deterministic.Stateful, context-dependent.
Reusable across deployments.Specific to a deployment.

Auto-rebuild fails this test cleanly: it answers "WHEN to rebuild" (policy), not "HOW to rebuild" (mechanism). The mechanism (ada-ivf-step CLI verb) is correctly placed. The decision to fire it should never have lived in the SDK.


Concrete scope: what DreamDB does

Layer 1 — Protocol (the spec)

MUST define:

  • Object kinds: Genesis, Manifest, Ref, Track, SpatialIndex, VectorCompressor, SpatialBucket, Fragment, ItemManifest, VectorStorage, ScalarBucket, IndexPage, GraphIndex, GraphPage
  • For each, the canonical CBOR shape and the content-hash rule
  • The Ref → Manifest → Track → Item resolution chain
  • The Manifest parents DAG (time-travel + collaboration semantics)
  • Lineage rules (which Objects' hashes appear in which others' headers)
  • The conformance test corpus

MUST NOT touch:

  • When to publish a new Manifest (policy)
  • How often to GC (policy)
  • Bucket size, batch size, k value (policy)
  • Auth, encryption, multi-tenant isolation (operator or app)
  • Query semantics ABOVE the dispatch layer ("most relevant" = cosine vs euclidean vs hybrid = policy)

Layer 2 — SDK / reference implementation

MUST provide verbs:

  • Dataset::create / open / append_many / iter / query
  • Connector::get / put / list_prefix / delete / head
  • Session for cached lookups
  • dreamdb-cli: rebuild-ivf, publish-rabitq, ada-ivf-step (split
    • merge), ada-ivf-status, gc, branch, merge, inspect

MUST emit signals — not act on them:

  • Imbalance score after each append (return as part of AppendResult or write to Manifest's dreamdb.recommendations registry)
  • GC candidate count (ada-ivf-status-style verbs report; don't act)
  • Per-cell record density (for operator's k-target calculation)
  • Bucket fragmentation level

MUST NOT do:

  • Schedule its own work (no daemons, no inline rebuilds, no inline GC)
  • Carry user policy state (no auto_rebuild=True schema flags)
  • Make decisions on the operator's behalf (no "if imbalance > 1.5 then rebuild" — instead: "imbalance is 1.5, here's the signal")

Layer 3 — Operator tools

Provides:

  • Cron entry / k8s CronJob / GitHub Actions workflow that calls dreamdb-cli verbs on schedule
  • Monitoring integration: scrape ada-ivf-status output, emit Prometheus metrics, page when threshold crossed
  • Retention policy: how many Manifests to keep, how aggressive to GC
  • Capacity policy: when to rebuild-ivf vs ada-ivf-step
  • Multi-region replication: which buckets to replicate, on what cadence

Out of scope for DreamDB: these are off-the-shelf tools (k8s, Prom, Argo, etc.). DreamDB just needs to BE schedulable — every maintenance operation must be a single shell command that exits with a clear status code. The CLI is the API to this layer.

Layer 4 — App

Provides:

  • The user-meaningful abstractions (Workspace, Library, Project, Stream)
  • UI / API / SDK that callers actually integrate with
  • Auth, multi-tenancy (subject filtering on top of a shared DreamDB bucket; the app enforces "user X can only see Track Y")
  • Quotas, rate limits, billing
  • The "Slack-style real-time collaboration" UX, with the app coordinating writes (e.g. routing one user's writes to user-X-branch, resolving merges with semantic understanding the protocol can't have)

Out of scope for DreamDB: DreamDB provides immutable storage primitives. Whether your app uses them to build a vector DB, a time-series store, a media library, or a memory layer for an AI agent is the app's call.


What this means for the current code

Should be removed from Layer 2 (the SDK)

Currently hereMove toWhy
Schema.auto_rebuild, auto_rebuild_max_n, auto_rebuild_thresholdDELETE (operator decides)Policy in protocol cloth. The operator's cron decides when to rebuild.
Dataset::ada_ivf_step_inlineDELETELayer 2 should never schedule its own work.
Density-gate hardcode (MIN_DENSITY_PER_CELL: u64 = 10)DELETE with aboveSame.
Default threshold 1.5DELETE with aboveOperator's threshold.
Hardcoded max_n = 10_000_000DELETE with aboveOperator's cap.

Removing these reverts Dataset::append_many to a pure-mechanism call that publishes one Manifest per batch with no side-channel maintenance work. The throughput collapse from auto_rebuild=True (280/s → 52/s) vanishes — it was self-inflicted.

Should be added to Layer 2 (currently missing)

WhatWhy
Dataset::branch(name: &str)Mechanism for the documented "feature branch" pattern. One PUT to <bucket>/refs/<name>.
Dataset::merge(other: &Ref, strategy: MergeStrategy)Mechanism for combining branches. Strategy: refuse-on-conflict (default), fast-forward-only, ours, theirs.
dreamdb-cli gc --keep-manifests=N --keep-since=DURATIONMark-and-sweep GC verb. Currently a 175-line Python script.
dreamdb-cli ada-ivf-step with merge supportMerge underpopulated cells. find_underpopulated_partitions exists in dreamdb-protocol/src/ada_ivf.rs; never used. Required to bound k growth (flaw §1).
Paged-track support in ada-ivf-stepRequired to maintain indexes at 10K+ cells (flaw §6).
dreamdb-cli ada-ivf-status reading the current Track, not list-prefixStop lying about imbalance (flaw §7).
Schema-migration verb: dreamdb-cli schema-update <ref> <new-cbor>Change a Schema's flags without re-ingesting.
Tombstone primitive in protocol + Dataset::delete verbGDPR/correction story (flaw §9). Spec-level.
Chain-aware lineage: SI carries parents, bucket lineage check walks chainThe single highest-leverage fix. Unlocks flaws §2, §5, §6, §10. Spec-level.

Should be added to Layer 3 (currently missing)

These are the policy/scheduling pieces that aren't DreamDB's job but need EXAMPLES so users know how to set them up:

WhatWhere
Example k8s CronJob calling ada-ivf-status + conditional ada-ivf-stepdreamdb-cli/examples/ada-ivf-cron.yaml
Example k8s CronJob calling gc --keep-since=7d dailydreamdb-cli/examples/gc-cron.yaml
Example Prometheus exporter scraping CLI outputdreamdb-cli/examples/prom-exporter.sh
Example Argo workflow for full rebuild + verifydreamdb-cli/examples/rebuild-workflow.yaml

The k8s YAML we already wrote (ada-ivf-step.yaml) is one of these. Notice it's an EXAMPLE in dreamdb-cli/examples/, not a verb. That's the right placement.

Should be added to Layer 4 (out of scope for us, but worth naming)

Users who build apps on DreamDB will need these. DreamDB shouldn't provide them; it should DOCUMENT that they're missing so app builders don't expect them from us:

  • User identity / auth
  • Per-user / per-team rate limits
  • Quota enforcement
  • Multi-tenant isolation
  • Real-time pub/sub for "new appends arrived"
  • Search-result ranking that uses domain knowledge (e.g. recency boosts, category filtering with semantic meaning)
  • A web UI / mobile UI / API gateway

The current browse.html demo is a Layer 4 app for the imagenet-100 demo. It belongs in an examples/ directory, not in the protocol or SDK. (It currently is in dreamdb-dataset-python/examples/web/, which is correct.)


Migration plan — how to actually do this

Phase 1 (1-2 days): remove the wrong things

  1. Delete Schema.auto_rebuild and its CBOR encode/decode paths.
  2. Delete Dataset::ada_ivf_step_inline from append_many.
  3. Delete the density gate, threshold default, max_n constant.
  4. Run all tests. The auto_rebuild_* integration tests delete.
  5. Update Python bindings (remove three kwargs from add_embedding).
  6. Memory updates: mark project_auto_rebuild.md as deprecated; pin a new memory explaining why.

After this, Dataset::append_many is pure-mechanism. Ingest throughput returns to ~500-2000/s (the IvfCosine.hash_vector + merge-on-write cost minus the rebuild stalls).

Phase 2 (1 week): add the missing mechanisms

  1. Dataset::branch(name) + Dataset::merge(other, strategy) in dreamdb-dataset/src/dataset.rs. Simple — one PUT each, plus the merge-strategy logic.
  2. dreamdb-cli gc verb (port the Python script to Rust; expose --keep-manifests=N --keep-since=DURATION).
  3. ada-ivf-step with merge support (the find_underpopulated_partitions primitive already exists in dreamdb-protocol/src/ada_ivf.rs; just wire it through).
  4. ada-ivf-status reading the current Track (~30 LOC change).

Phase 3 (2-3 weeks): the spec changes

  1. SI Object gets parents: Vec<Multihash>. Bucket lineage check walks the chain. spec/0007 amendment.
  2. Tombstones primitive: define the CBOR shape, query semantics, and GC interaction. spec/0009 amendment? or a new spec/0020.
  3. Schema-migration verb: define what it can change without re-ingesting.

Phase 4 (ongoing): operator examples

Build out the dreamdb-cli/examples/ directory with the cron / k8s / Argo recipes. Each one is a 50-200 line file with a comment block explaining the policy decision it implements. These ARE the documentation for "how do I run DreamDB in production."


The hardest part

Phase 1 is technically trivial (delete code) and emotionally hard. We just shipped auto_rebuild=True to address a real user concern ("appends should self-heal"). Deleting it admits that the implementation was the wrong shape — that the real concern was a Layer 3 scheduling problem, not a Layer 2 SDK problem.

The honest framing for the doc: "we don't make automation worse for small datasets; we make automation possible for ALL datasets by moving it to the right layer." Small users get a k8s CronJob template instead of an SDK flag. The CronJob template is in our repo. They run kubectl apply. They get the same outcome, on the right side of the layer boundary.


How to use this doc

When evaluating a future DreamDB feature, ask:

  1. Does it answer "what CAN happen" or "what SHOULD happen now"?
  2. If "what should happen now" — it's the operator's or app's job, not ours.
  3. If "what can happen" — fine, but is the SDK the right place, or does it belong in the protocol spec?
  4. Does it carry state that varies across deployments? If yes, it's policy (Layer 3 or 4), not mechanism.
  5. Does it schedule its own work? If yes, you're building a daemon inside the SDK. Don't.

The single sentence: DreamDB provides mechanisms and signals; layers above DreamDB decide what to do with them.