DreamDBv0.2.0bec026

Spec 0017 — Schema Evolution and Embedding Migration

Status: Draft (Phase 4 design). Depends on: spec/0001, spec/0002, spec/0006, spec/0008, spec/0010, spec/0016. Motivation: A 10B-item DreamDB deployment outlives any one embedding model. OpenAI's text-embedding-3 obsoleted text-embedding-ada-002 within 14 months; clip-ViT-B/32 has been the dominant image encoder for 3 years and will eventually be replaced. When the model upgrades, the operator faces a brutal choice today: keep the old corpus and accept degraded quality on new queries, or re-encode 10B items in one atomic operation that takes weeks. Both are wrong. spec/0017 defines the protocol-level primitives — multi-version modality registries, a Reencode verb, and a compatible_with hint — that make incremental, partial, resumable migration the default.


1. Purpose

The protocol's immutability and content-addressing make schema evolution structurally easy: a new modality is just a new modality, with its own Track Object, its own SpatialIndex, its own VectorCompressor. The hard part is migrating gracefully — keeping queries working through the transition, sharing storage where possible, and avoiding the "stop the world" rebuild.

By the end of this document the following are concrete:

  • Multi-version modalities in a single Manifest's registry — embedding.v1 and embedding.v2 register independently, share Items, and queries route to the right version per-call.
  • The Reencode verb: operator-driven bulk re-index that reads source Items, applies a transform (typically: run new model inference), writes target Items, and incrementally publishes progress. Resumable, idempotent.
  • The compatible_with registry hint: optional declaration that a new modality is approximately compatible with an old one — used by the query planner (spec/0015) to fall back gracefully during migration.
  • Versioned modality strings: a discipline for naming evolving modalities so old/new tracks coexist without ambiguity.
  • The migration manifest pattern: how a long-running re-encode produces a sequence of Layer Manifests rather than one giant atomic commit.

What stays defined elsewhere:

  • Per-modality storage layouts — spec/0007, spec/0010, spec/0013.
  • The Layer mechanism — spec/0008 §3.
  • Streaming updates / hot-shard — spec/0016.

What this document does NOT define:

  • Automatic model selection. Which model to migrate TO is operator-policy.
  • Transform correctness verification. That the new model's outputs are "right" is the operator's training-eval concern, not the protocol's.
  • Cross-modality lossy conversion. Converting a CLIP embedding to a BERT embedding is meaningless and out of scope.
  • Model deployment. How operators run inference at scale is implementation-defined; this spec defines only the DreamDB-side coordination.

2. Multi-version modalities

2.1 Modality versioning convention

A version-aware modality string carries an explicit version=<N> parameter:

embedding.f32.dim=768.bucketed.spatial-bits=18.version=1
embedding.f32.dim=1024.bucketed.spatial-bits=18.version=2

The version parameter is OPTIONAL. Without it, modalities are unversioned (effectively version=1 implicit). Two modalities with the same shape but different version are distinct modalities for all protocol purposes — different path slots, different Track Objects, different SpatialIndex Objects, different bucket headers.

This is intentional. Modality strings are content-addressing keys; two modalities are "the same" iff their strings are identical. The version parameter makes incompatibilities visible.

2.2 Multiple versions in one Manifest

A Manifest registry MAY declare multiple versions of the same logical concept:

"registry": {
  "embedding.f32.dim=768.bucketed.spatial-bits=18.version=1": {
    "kind":          "continuous",
    "object_kind":   "spatial-bucket",
    "algorithm":     "dreamdb.imi-cosine",
    "spatial_index": [<old-SI-hash>],
    "track":         <old-Track-hash>,
  },
  "embedding.f32.dim=1024.bucketed.spatial-bits=18.version=2": {
    "kind":          "continuous",
    "object_kind":   "spatial-bucket",
    "algorithm":     "dreamdb.imi-cosine",
    "spatial_index": [<new-SI-hash>],
    "track":         <new-Track-hash>,
    "compatible_with": [
      { "modality": "embedding.f32.dim=768.bucketed.spatial-bits=18.version=1",
        "relationship": "supersedes",
        "transform_ref": <multihash | null>,    ;; OPTIONAL: how to convert v1 → v2
        "coverage": "complete" | "partial",     ;; whether v2 covers every v1 Item
      }
    ]
  }
}

Both versions are individually queryable; the second's compatible_with field declares its relationship to the first. The query planner (spec/0015) uses this to route hybrid queries across the migration boundary.

2.3 Coverage during migration

During a long-running migration, coverage = "partial" indicates that not every v1 Item has been re-encoded to v2 yet. The planner's behavior depends on the query:

  • Query against v2 explicitly: returns only v2 results. Coverage gap is visible to the application.
  • Query against v1 explicitly: returns v1 results (unchanged).
  • Query against the logical concept (no explicit version): planner queries BOTH versions, falls back to v1 for Items not in v2's coverage set. Returns a unified ranking.

The "logical concept" path requires a small extension to the query verb's track_selector — see §4.

3. The Reencode verb

A new verb. Reencode reads source Items from a source modality, applies a transform (operator-supplied function), writes target Items to a target modality, and publishes progress as Layer Manifests.

3.1 Verb signature

;; Reencode RPC body (CBOR)
{
  "source_modality":  "<modality-tag>",
  "target_modality":  "<modality-tag>",
  "transform_ref":    <multihash>,                   ;; opaque transform identifier
  "batch_size":       <unsigned int>,                ;; items per Layer Manifest
  "anchor_range":     [<lo: u64>, <hi: u64>] | null,  ;; optional bounded re-encode
  "resume_from":      <multihash | null>,             ;; if non-null, resume from this prior Reencode state
  "capability":       <bytes>,                        ;; spec/0012 token; requires "write" scope
}

The SDK implementation walks the source Track's Items in anchor order, applies the transform (out-of-band — DreamDB doesn't dictate how), and appends to the target Track. Progress is checkpointed every batch_size Items via a Layer Manifest.

3.2 What is transform_ref?

An opaque content hash. The protocol does NOT define what bytes it points to — different deployments use it differently:

  • Inference deployments: transform_ref points at a small CBOR Object describing model identity, weights hash, preprocessing config. The SDK uses this to look up the correct inference endpoint.
  • Pure-transform deployments (e.g., re-normalizing existing vectors): transform_ref points at a CBOR Object describing the math.
  • Test deployments: transform_ref is a no-op identifier; the SDK skips actual encoding.

DreamDB stores the hash as audit trail. The operator's external system resolves it.

3.3 Idempotency and resume

Reencode publishes intermediate progress as Layer Manifests, each one valid as a queryable state. A crash mid-Reencode leaves the system in a consistent partial state — the next invocation with resume_from: <last-published-Manifest-hash> picks up at the next batch.

The Layer Manifest's body carries a small reencode_state sub-Object:

"reencode_state": {
  "source_modality":  "<modality-tag>",
  "target_modality":  "<modality-tag>",
  "transform_ref":    <multihash>,
  "items_done":       <unsigned int>,
  "items_total":      <unsigned int>,
  "last_anchor":      <u64>,
  "checkpoint_at":    <u64>,                          ;; Unix ns
}

Resume reads this; verifies the source modality has not changed since the checkpoint (Manifest parent-chain walk); continues from last_anchor + 1.

3.4 Concurrency with live ingest

Reencode runs in parallel with live writers appending to the source Track. The migration's coverage view treats anything in the source's HotShard or appended after last_anchor as "not yet migrated"; new appends after checkpoint_at are flagged as backlog for the next Reencode pass.

For a continuously growing Track, full coverage = "all items written before the final Reencode pass." Operators typically run several passes:

  • Pass 1: covers the bulk corpus at time T1. Coverage at completion: items with anchor < T1.
  • Pass 2: covers the backlog (items added between T1 and T2). Faster, smaller.
  • Pass 3+: convergence as backlog shrinks.

Eventually the operator declares "migration complete" and updates the coverage field to "complete" (or removes the v1 modality from active registry).

3.5 Resource budgeting

Reencode is resource-intensive (one inference forward pass per Item × 10B Items can take days). The verb body MAY include budgets:

"budgets": {
  "max_items_per_hour":  <unsigned int>,
  "max_concurrent":      <unsigned int>,             ;; SDK-side parallelism
  "deadline_at":         <u64>,                       ;; Unix ns; SDK pauses if reached
}

The SDK pauses at deadline_at and writes a resume-able checkpoint, NOT failing mid-batch. Operators use this for off-peak migration windows.

4. Query planner extensions (spec/0015 amendment)

4.1 Logical track selector

The HybridQuery's track_selector (per spec/0015 §5.1) gains an OPTIONAL logical_concept field:

"track_selector": {
  "logical_concept":  "embedding",                  ;; the un-versioned shape
  "version_preference": "latest" | "all" | "<version-spec>",
}

logical_concept is a free-text label; version_preference controls planner behavior:

  • "latest": query the highest-version registered modality of this concept. v1 coverage gap NOT filled.
  • "all": query every version + merge results via the planner's hybrid fusion. v1 coverage gap automatically filled by v1.
  • "<version-spec>": query a specific version (e.g., "version=2").

The default is "latest" — applications get the new model's results unless they explicitly opt into migration-aware behavior. Migration tooling SHOULD set "all" during the transition.

4.2 Planner fallback logic

For "all" with compatible_with declared:

1. Compute query encoding under EACH version's model.
2. Run per-version sub-queries in parallel.
3. For each source Item:
   - If present in highest-version result set: use that score.
   - Else: use the highest-version-where-present score, weighted by an
     "older-version penalty" coefficient (default 0.9 per version step).
4. Fuse and return top-K.

The older-version penalty is a small but non-zero discount that prefers fresher embeddings when an Item appears in multiple versions. Default 0.9; operator-tunable.

4.3 Score scale calibration

Different model versions produce different cosine-similarity distributions. Linear fusion across versions risks miscalibration. RRF (spec/0015 §5.2) is scale-invariant and is the recommended default for multi-version hybrid queries.

5. Garbage collection across versions

The protocol's GC (spec/0006 §7.3) is purely content-reachability — Objects unreachable from any live Ref are eligible for deletion. Multi-version registries simply keep the old version's Track + SpatialIndex reachable until the operator explicitly removes them from registry.

5.1 Decommissioning v1

When the operator decides v1 is no longer needed:

  1. Publish a Manifest whose registry omits v1 entirely.
  2. Old Manifests in the parent chain still reference v1's Track Object — they remain reachable until the parent chain is GC'd.
  3. Eventually, after the GC's safety threshold (default 24h, spec/0006 §7.3), v1's Objects become eligible for deletion.
  4. Optional: a "snapshot roll-up" (spec/0008 §9.3) accelerates GC by collapsing the parent chain.

5.2 Concurrent migration safety

Two operators running parallel migrations to different target modalities (e.g., v2 and v3 concurrently) is supported — each Reencode publishes its own Layer Manifest. The standard spec/0008 merge / rebase rules apply at Publish time.

6. Worked example: 10B CLIP-B/32 → CLIP-L/14 migration

Concrete scenario. DreamDB deployment with 10B image embeddings under embedding.f32.dim=512.bucketed.spatial-bits=22.version=1 (CLIP-B/32). Operator wants to migrate to CLIP-L/14 (dim=768).

Step 1: register v2 alongside v1.

embedding.f32.dim=768.bucketed.spatial-bits=22.version=2

Manifest registry now has both. coverage = "partial" for v2; items_done = 0.

Step 2: invoke Reencode.

Reencode(source = v1, target = v2, transform_ref = <CLIP-L/14 model ref>,
         batch_size = 1_000_000, budgets = {max_items_per_hour = 1e8})

Step 3: Reencode publishes Layer Manifests every 1M Items. After ~100 hours (at the budgeted rate), all 10B Items re-encoded.

Step 4: operator updates v2's coverage to "complete". Queries via version_preference = "all" now route to v2 exclusively (with no v1 fallback needed).

Step 5 (optional): operator omits v1 from a subsequent Manifest's registry. v1's Objects become GC-eligible after the safety threshold.

Cost:

  • Inference: 10B forward passes × ~5 ms = ~14000 GPU-hours (operator-side, out of band).
  • Storage: ~10B × 3 KB (v2 uncompressed) = 30 TB during transition (v1 + v2 both present). After GC: 30 TB (v2 only; v1 freed).
  • Network: ~30 TB outbound from compute layer to backend; standard.
  • Wall clock: bounded by inference, not DreamDB.

The DreamDB side adds zero new failure modes — partial state is always a valid queryable Manifest; resume is idempotent; rollback is just "publish a Manifest that omits v2."

7. Conformance categories (per spec/0009 §8.6.3)

CategoryPass criterionCoverage
evolve.multi-version-registry.*Registry with v1 + v2 both query correctlyBoth versions, independent queries
evolve.reencode.resumable.*Crash mid-Reencode + resume produces same final state as uninterrupted runFailure injected per-batch
evolve.reencode.checkpoint-monotonic.*last_anchor strictly increases across batchesAdversarial batch orderings
evolve.planner.all-versions-fallback.*version_preference: "all" fills v2 coverage gap with v1 scoresPartial coverage scenarios
evolve.gc.decommission-v1.*After v1 removed from registry, its Objects become eligible after safety thresholdStandard GC test
evolve.compatible-with.semantics.*coverage: "complete" means every v1 Item present in v2; verifier assertsMixed coverage states

8. Out of scope

  • Lossy embedding-space alignment. Migrating from one embedding model to another with substantially different geometry (e.g., 1024-dim → 1536-dim) is a research problem; DreamDB stores the bytes, not the alignment.
  • Cross-model query. "Search v1 corpus using a v2 query" requires a learned alignment matrix; out.
  • Schema diff / merge tools. Out of protocol; SDK / CLI concern.
  • Automatic transform validation. That transform_ref actually corresponds to model X is the operator's audit problem.

9. Open questions

  • OQ-71 (→ this spec): Should compatible_with declare a similarity-space transform (e.g., a learned alignment matrix hash) for cross-version score combination? Currently we use the older-version penalty heuristic; a learned alignment could be more principled. Defer to v0.X+1.
  • OQ-72 (→ this spec): Reencode budgets — are they advisory or enforced? Advisory by default; backend may add enforcement via spec/0018 multi-tenant quotas.
  • OQ-73 (→ spec/0006): Should Reencode be a 11th verb or fold into Append-with-source-pointer? Probably a distinct verb; it has multi-batch semantics that don't fit Append cleanly. Resolve in spec/0006 amendment.
  • OQ-74 (→ spec/0009): Conformance vectors for Reencode under failure (network blip, partial PUT). Block v0.X release.

Next: spec/0018 — multi-tenant operation. Now that 10B-scale + federation + hybrid + streaming + schema evolution all work for ONE tenant, can they work for many at once without collapsing into noisy-neighbor chaos?