Spec 0019 — Data-Plane Encryption (Sketch)
Status: Draft sketch (Phase 4 design; deep-detail deferred to v0.X+1 after first operator pilot).
Depends on: spec/0001, spec/0002, spec/0005, spec/0012, spec/0018.
Motivation: Regulated industries (healthcare, finance, defense, EU-GDPR scope) need encryption of DreamDB content at rest and in transit, with key management aligned to enterprise KMS. The protocol's content-addressing creates a non-obvious tension: traditional encryption changes ciphertext bytes per-encryption (via random IVs), which breaks content-hash equality and therefore breaks dedup, federation, and the cache identity that spec/0006 §3 depends on. This spec sketches the resolution — content-addressing operates on ciphertext, plaintext stays hidden — and pins the open questions a full v0.X+1 spec must answer.
This spec is intentionally a SKETCH. It lands the design framing, the load-bearing primitives, and the open questions; the bit-level CBOR/header formats are explicitly deferred until at least one production deployment exercises the model.
1. Purpose
The protocol's design (spec/0000 Principle 4) commits to "content-addressing as identity." Encryption that breaks this commitment breaks the protocol. The fix is to treat encryption as a transformation inside the content-address: the hash is computed on the encrypted bytes, not the plaintext.
This works because:
- A reader fetching
<hash>from the backend gets the same ciphertext bytes regardless of who's reading (subject to decrypt key access). - Two writers producing the same plaintext + same key + same encryption parameters produce the same ciphertext (provided the IV is deterministic — derived from plaintext + key, not random). This preserves dedup.
- Federation works unchanged: ciphertext is just bytes; backends serving it don't need to know the key.
The cost: convergent encryption (deterministic IV) is vulnerable to known-plaintext attacks. A trivial mitigation (per-Space salt in the IV derivation) defends against most realistic threats but leaks "this Space contains a known document" to anyone with backend access. Operators in high-security tiers must accept loss of cross-Space dedup as the price of standard random-IV encryption.
By the end of this document the following are concrete:
- The two encryption modes DreamDB supports: convergent (dedup-preserving) and per-Space (dedup-within-Space). One operator selects per Manifest.
- The EncryptionMeta Object: per-Track / per-modality encryption parameters declared in the Manifest registry.
- The KMS abstraction: how a DreamDB backend invokes external key services. AWS KMS, GCP KMS, HashiCorp Vault all fit; no DreamDB-specific KMS protocol.
- The envelope encryption pattern: each Object encrypted with a per-Object data-encryption-key (DEK), which is itself encrypted by a key-encryption-key (KEK) held by the KMS.
- The content-addressing-preserving discipline: hash is on ciphertext; clients verify on fetch; plaintext never appears in the address grammar.
- The search-over-encrypted reality check: this spec does NOT promise search over encrypted vectors. Encryption is for at-rest protection; the SDK decrypts before searching. Searchable encryption (e.g., PSI, ORE) is out of scope.
What stays defined elsewhere:
- Capability tokens for read/write authorization — spec/0012, spec/0018. Encryption is orthogonal: a valid token authorizes you to retrieve ciphertext; key access authorizes you to decrypt.
- Quotas / multi-tenant isolation — spec/0018.
- Manifest registry shape — spec/0002 §7.2.
What this document does NOT define:
- Per-token-of-text encryption. Token-level encryption over inverted-index posting lists is research; out of v0.X.
- Searchable symmetric encryption (SSE). Out.
- Homomorphic compute over encrypted vectors. Out.
- TLS / HTTPS encryption in transit. spec/0005 already mandates HTTPS for federated transport.
- Key rotation strategies. KMS-native concern; this spec describes only how DreamDB consumes rotated keys.
2. The two encryption modes
2.1 Convergent (deduplicating)
Each Object's encryption key is derived deterministically from the plaintext + a per-Space salt:
Two writers producing the same plaintext (same Space, same salt) produce identical ciphertext. Cross-writer, cross-Item dedup is preserved. The content-hash address is on the ciphertext + a small encryption header (§4); two callers asking "fetch hash X" get the same bytes.
Tradeoff: an attacker with backend access who knows a plaintext can verify its presence by deriving the expected ciphertext hash. They cannot decrypt unknown content, but presence is leaked.
This mode is appropriate for: most enterprise data, internal corporate documents, dataset libraries where presence-leakage of common files (e.g., "the Linux kernel source") is acceptable.
2.2 Per-Space (non-deduplicating)
Each Object's encryption key is random per-encryption, sealed under the KMS:
Two writes of the same plaintext produce different ciphertext. No cross-write dedup. Address is on the ciphertext, which is unique per write — every write produces a fresh hash, even of identical content.
Tradeoff: no dedup; storage cost scales linearly with write count. But presence-leakage is eliminated.
This mode is appropriate for: medical records, regulated financial data, anything where "two patients have the same diagnosis bytes" must not be inferable.
2.3 Mode selection
A Space's encryption mode is declared in the Manifest's Space-config sub-Object:
mode: "none" is the v0 default — no encryption (single-tenant or low-sensitivity deployments). The other modes are opt-in per Space at Genesis time. Changing modes mid-Space requires a full Reencode (per spec/0017).
3. KMS abstraction
DreamDB does not implement key management. It calls out to an external KMS — AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault, on-prem HSM, etc.
3.1 The KMS contract DreamDB assumes
That's it. Every major KMS provider implements these two primitives. The SDK calls them through an operator-configured connector (HTTPS to the KMS endpoint, typically with mutual-TLS or IAM-signed requests).
For convergent mode, the SDK ALSO needs the Space's salt. The salt itself is encrypted under the KMS at Space creation; the operator must grant decrypt access to the salt to every authorized reader/writer (separate from per-Object key access).
3.2 Caching DEKs
DEK retrieval is the high-latency step (~10-50 ms per KMS call). SDKs MUST cache:
- Per-Object DEKs after first retrieval (LRU; default 10K entries).
- Per-Space salts (long-lived; refresh on policy changes).
Cache eviction is at SDK session end. KEK rotation invalidates the entire DEK cache for the affected KEK; the SDK MUST handle this gracefully (re-fetch on next decrypt attempt).
4. The EncryptionMeta Object
Per-Object encryption metadata lives inline at the start of the Object's ciphertext bytes, NOT in a separate Manifest field. This keeps content-addressing simple: the hash covers the ciphertext + meta together.
4.1 Header layout
The header is ~50-300 bytes, depending on KEK ID + sealed DEK length. Negligible per-Object overhead.
4.2 Convergent mode header
For convergent mode, Sealed DEK length = 0 (DEK is derivable from plaintext + salt). Decrypt path:
The "plaintext-content guess" step is the awkward part of convergent — a reader needs to know the plaintext hash to derive the key. In practice this is solved by separately storing the (hash → plaintext_hash) mapping in an additional small Object (encrypted under per-Space mode) that the reader fetches first. Alternative: pre-compute the plaintext_hash at write time and include it in the Manifest registry alongside the Object reference.
(This is the kind of detail the full v0.X+1 spec needs to nail down. The sketch acknowledges the issue.)
4.3 Per-Space mode header
For per-Space mode, Sealed DEK length > 0. Decrypt path:
Simpler than convergent — the DEK is fully described in the header.
5. Content-addressing-preserving discipline
The protocol's content-addressing semantics are preserved by these disciplines:
- Hash is on ciphertext + EncryptionMeta header, not plaintext. The address grammar is unchanged.
- Backend stores opaque bytes. The backend has no key access; the content-hash equality property survives.
- Federation works unchanged. A federated mirror replicates ciphertext; readers with key access decrypt locally.
- Cache identity = content hash (spec/0006 §3.2). Two cache entries for the same hash are bit-identical (same ciphertext + same plaintext after decryption).
- Dedup is mode-specific. Convergent ⇒ dedup preserved; per-Space ⇒ dedup lost (acceptable for high-security).
6. Search-over-encrypted reality check
The protocol does NOT support searching encrypted vectors or encrypted text. The mental model is:
- At rest: encrypted. Operators with backend access cannot read plaintext without KMS access.
- In the SDK's memory: decrypted. The SDK has KMS access (via operator-issued capability + key access); it decrypts on fetch and operates on plaintext for ANN search, BM25 scoring, etc.
- In transit: TLS (spec/0005 §6.1.1).
Practical implication: the SDK is the trust boundary. An attacker compromising the SDK runtime sees plaintext. An attacker compromising the backend sees only ciphertext.
For deployments needing search-over-encrypted (e.g., a SaaS that wants to host encrypted user data without ever seeing plaintext): spec/0019 does not provide this. The SDK must be deployed in the user's environment, not the SaaS provider's. Multi-party-computation or homomorphic-search-over-vectors techniques are research-grade and out of v0.X.
7. Open questions (the load-bearing list)
Because this is a SKETCH, the open questions are unusually load-bearing — each blocks promotion to full draft:
- OQ-79 (→ this spec): Convergent-mode plaintext_hash bootstrap. How does a reader learn the plaintext_hash needed to derive the DEK? Options: (a) inline in EncryptionMeta header (defeats convergent's presence-leakage defense), (b) separate small mapping Object encrypted per-Space, (c) computed at write time and registered in the Manifest. Pick (b) or (c) after pilot.
- OQ-80 (→ this spec): Key rotation choreography. When the KEK rotates, sealed DEKs (per-Space mode) are still valid (decrypt against the old KEK version). When are they re-sealed? Probably background-rewrite per-Object on the operator's schedule. Spec needs to pin the contract.
- OQ-81 (→ spec/0012): Federation under encryption. Mirror backends store ciphertext; readers at the mirror need KMS access. Does the federation protocol describe trust delegation between KMS instances, or punt to "operators coordinate"? Punt is acceptable for v0.X; revisit in v0.X+1.
- OQ-82 (→ spec/0010, spec/0013): Encrypted SpatialIndex / VectorCompressor / GraphIndex. These Objects' content (codebooks, hyperplanes, graph adjacency) is sometimes sensitive — leaks info about the training data. Encrypt them under the same per-Space scheme? Probably yes for per-space mode; convergent mode doesn't apply since these aren't deduped across Spaces.
- OQ-83 (→ this spec): AEAD algorithm registry. Pin to AES-256-GCM for v0.X. ChaCha20-Poly1305 is a defensible alternative for low-end hardware without AES-NI. Add as opt-in?
- OQ-84 (→ spec/0006): GC under encryption. GC walks Manifests for reachable hashes — works on ciphertext addresses, no key access needed. Should be a no-op extension; verify in pilot.
- OQ-85 (→ spec/0018): Quota accounting for encrypted Objects. Encrypted Objects are slightly larger than plaintext (header + AEAD tag). Does storage quota count ciphertext bytes or plaintext bytes? Probably ciphertext (it's what the backend stores).
- OQ-86 (→ spec/0009): Conformance vectors for encrypt/decrypt round-trip across architectures. AES-256-GCM is well-tested but the EncryptionMeta CBOR shape needs round-trip vectors. Block v0.X+1 release.
- OQ-87 (→ this spec): Per-modality encryption opt-out. Operators may want title.text Constants unencrypted (for human inspection) while embedding Spatial Buckets are encrypted. Add per-modality encryption override in registry?
8. Conformance (placeholder)
The full conformance battery is deferred to the v0.X+1 spec promotion. Categories the v0.X+1 draft MUST include:
| Category | Pass criterion (preliminary) |
|---|---|
enc.convergent.dedup.* | Same plaintext + same salt → identical ciphertext hash |
enc.per-space.uniqueness.* | Same plaintext + per-space mode → different ciphertext hash per write |
enc.aead.tamper-detect.* | Modified ciphertext byte → AEAD tag verification fails |
enc.kms.dek-cache.* | Per-Object DEK cached after first KMS call; verified by reduced KMS RPS |
enc.cross-arch.aes-roundtrip.* | AES-256-GCM round-trip bit-identical on x86-64 / ARM64 / Apple Silicon |
enc.metadata.federation-safe.* | EncryptionMeta header + ciphertext round-trip through federate verb intact |
9. Out of scope
- Per-user keys. DreamDB's tenant boundary is the Space; sub-tenant per-user encryption is application-layer.
- PIR (Private Information Retrieval). Hiding query patterns from the backend is research; out.
- Quantum-resistant encryption. AES-256-GCM is post-quantum-safe for symmetric encryption; for KEK protection, operators using KEM/RSA must follow their KMS's quantum-resistance roadmap.
- Encrypted Manifests. Manifests are public metadata by design — they enumerate addresses, registry entries, schema. Encrypting them would break the operator-side audit and federation discovery story. v0 leaves Manifests cleartext; v0.X+ MAY add an opt-in encryption pass with documented operational tradeoffs.
This was the final spec in the Phase-4 batch. The protocol now spans 20 numbered drafts covering: data model, addressing, time, spatial/scalar/graph indexing, federation, streaming, hybrid retrieval, multi-tenant, encryption. The next step is implementation pilot — start with the smallest-blast-radius spec (probably spec/0014 Path A chunking) and validate the spec-first workflow end-to-end.