DreamDBv0.2.0bec026

Changelog

All notable changes to DreamDB are documented here. Reference implementation status; protocol-spec changes are noted with their spec doc reference. Dates use ISO format.

[Unreleased]

Added — AWS S3 / SigV4 production path (2026-05-22)

  • dreamdb-connector-http/src/sigv4.rsS3Signer wraps the aws-sigv4 crate; produces Authorization + x-amz-date + x-amz-content-sha256 + (optionally) x-amz-security-token headers per request.
  • HttpConnectorConfig.signer_from_env() auto-detects AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN / AWS_REGION (falls back to us-east-1). No signer attached → connector sends unsigned requests (MinIO-anonymous dev mode preserved).
  • Every connector verb (PUT / GET / HEAD / LIST / DELETE / multi-range GET) now signs when a signer is configured; SigV4 covers all wire headers including per-request If-Match, Range, etc.
  • dreamdb-cli/src/main.rs and dreamdb-dataset-python/src/lib.rs factories pick up the env-var signer automatically — same code talks to MinIO and S3.
  • Verified live against s3://dreamdb-test-20260518/ in us-east-1: create + append + count + snapshot + branch + history + delete all round-trip; 69 Objects landed, 19.8 KB total.

Added — Python SDK ergonomics (2026-05-22)

  • Dataset.iter_stream(batch_size, fields, channel_buffer) (B1.5): real Python generator backed by a tokio mpsc channel. Bounded RAM per batch (verified 104 MB peak RSS on 91K records). Closes the last 10B-scale follow-up.
  • Dataset.delete(anchors, reason=None) — tombstone anchors via spec/0020. Returns the new Manifest hash.
  • Dataset.tombstone_set() — resolved set of suppressed anchors at the current Manifest.
  • Dataset.merge(other_ref, strategy="fast-forward"|"union-tracks") — full Python parity with the Rust MergeStrategy enum.
  • Dataset.merge_many(branches) — N-way sequential union-merge for sharded ingest.
  • Dataset.history(max_depth=50) — walks the Manifest DAG via parents[0]; returns [{manifest, ts_ns, writer, parents_count, tracks_count}, …]. Mirrors the browser UI's ⏳ history button.
  • Dataset.list_refs() — lex-sorted list of every ref under the backend's refs/ prefix.
  • Dataset.count() + __len__ — record count at the current tip, via iter_stream. First-time-user reflex now works.
  • Schema is chainable: (vd.Schema().add_image(...).add_embedding(...).add_scalar_categorical(...)). Wrapped _RustSchema; backward compatible.
  • Friendlier HTTP errors: 403 / 404 / 412 / connection-refused / SI-conflict / schema-mismatch errors now carry actionable hints (e.g. "for local MinIO dev: mc anonymous set public local/<bucket>").

Added — CLI ergonomics (2026-05-22)

  • dreamdb query --backend ... --ref-name ... --field <name> --query-file <path> — top-K vector search from the command line. Operator spot-check verb; reads raw LE-f32 query bytes from a file.
  • dreamdb delete --ref-name ... <anchor> [<anchor>...] — tombstone CLI (paired with the SDK method).
  • dreamdb merge-many --ref-name <trunk> <branches>... — orchestrator CLI for sharded-ingest workflows.
  • Fixed: dreamdb snapshot --help no longer shows the GC description (doc-comment block at main.rs:207 was attached to the wrong variant); dreamdb gc --help now has its own help text.

Added — operator manifests (2026-05-22)

  • dreamdb-cli/examples/sharded-ingest.yaml (228 lines) — 3-Job k8s pattern for N parallel workers + merge-many orchestrator.
  • dreamdb-cli/examples/ada-ivf-step-sharded.yaml (246 lines) — 4-stage pipeline (centroids → publish-SI → redispatch → finalize) for B3's sharded redispatch at 10B-scale.
  • dreamdb-cli/examples/ada-ivf-step.yaml — header updated with "when to use which YAML" pointer.

Added — documentation (2026-05-22)

  • README.md rewritten (40 → 165 lines): pitch + 4 first principles + Python sketch + status table + 60-second quickstart + repo layout + spec roadmap with honest gaps. Now reflects the actual shipped state.
  • docs/tutorial.md (370 lines): "DreamDB in 10 minutes" end-to-end walkthrough — schema + ingest + snapshot + query + PyTorch DataLoader + time-travel + ada-ivf-status + delete + sharded ingest. Step 14 documents the S3 migration path.
  • INDEX.md: refreshed to 21 specs (was 20); added spec/0020 row; "What's Next" updated to reflect shipped state.

Spec changes (2026-05-18 → 2026-05-22)

  • spec/0001 §2.1 — Object-kinds table expanded from 10 to 17 entries (added TombstoneList, ScalarIndex, VectorCompressor, ItemManifest, GraphIndex, GraphPage, plus existing ones that were missing).
  • spec/0008 §5.3 (new) — formalizes the distinction between layered-merge (the original §6.1 approach: both parents' Tracks coexist as separate TrackEntrys) and fused-merge (B2's implementation: one merged TrackObject per modality with cell-by-cell bucket reconciliation). Reference implementation ships fused-merge for SpatialBucket tracks; both are spec-valid.
  • spec/0020 §3.1 — broken cross-reference "spec/0001 §3.2" corrected to "spec/0002 §7.5" (path table actually lives in spec/0002).
  • spec/0020 §3.1 — TombstoneEntry anchor field clarified to be u64 (the Item-level TimeAnchor), not Multihash. Single anchor suppresses every record across every modality.

Changed — dreamdb-dataset/src/dataset.rs split into 10 modules

The monolithic 5592-line file was split into a slim facade (dataset.rs, 184 lines) plus 9 single-responsibility submodules:

ModuleLOCResponsibility
dataset.rs184facade — structs (Dataset, Batch, MergeStrategy, DatasetVersion, FieldTrack), accessors, mod declarations
dataset/append.rs1035append, append_many (write path)
dataset/create.rs592create, open* (lifecycle)
dataset/fetch.rs337private fetch helpers for Manifest/Track/Bucket Objects
dataset/iter.rs1155iter, iter_stream, iter_with_fields, iter_time_range
dataset/layer.rs570add_scalar_layer, add_embedding_layer
dataset/merge.rs690merge, merge_many, union-merge family
dataset/snapshot.rs254snapshot, branch, history, list_refs
dataset/tombstone.rs235delete, tombstone_set
dataset/util.rs906free helpers, SpatialDispatcher, unit tests

Test suite unaffected: 721 tests still pass after the split.

Added — 10B-scale blocker push (2026-05-15 → 2026-05-18)

All 7 execution blockers from design/0006-10b-scale-blockers.md shipped; B7 audit verified no-op:

#BlockerStatusKey change
B1Streaming iterDataset::iter_stream lazy bucket walk, bounded RAM
B5Parallel blob fetchbuffer_unordered(64) per-anchor prefetch
B3Sharded redispatch4-stage k8s pipeline (--orchestrate-phase, --redispatch-shard)
B4Prefix-sharded GCdreamdb gc --shard N --of M — partition by leading-u64 of multihash
B8Tombstonesspec/0020 + Dataset::delete + dreamdb delete
B6Rayon hash_vectorParallel IvfCosine::compute_dots above 512K-flop threshold
B7Manifest size at 10B✅ audit~2.5 KB at 10B; 400× headroom under 1 MiB
B2Sharded ingestMergeStrategy::UnionTracks + Dataset::merge_many + dreamdb merge-many

Latent bug discovered en route: post-FastForward field_tracks wasn't being refreshed → reads returned stale pre-merge results. Fixed via Dataset::refresh_field_tracks_from_current().

Test suite: 705 → 721 tests, 0 regressions.

Other quality work

  • Spec audit: 6 stale-claim fixes across design/0001-0007 (Phase-0 banners refreshed, "Still no tombstones" → "✅ shipped", phantom spec/0021 references removed, etc.).
  • Unused-import warnings cleanup: cargo fix + manual sweep removed ~30 dead imports across dreamdb-dataset, dreamdb-protocol, dreamdb-bench.

Live evidence

  • 231K imagenet-100 ingested with CLIP embeddings, RaBitQ corrected, IVF k=70. Validated linear-probe training (val_acc 4.9% → 36.0% in 5 epochs).
  • B1.5 streaming verified mid-ingest on a half-baked 91K-record bucket: 5400 records/sec, 104 MB peak RSS, 425ms first-batch latency.
  • AWS S3 SigV4 verified on s3://dreamdb-test-20260518/ (us-east-1): full Dataset lifecycle round-trips (create/append/count/snapshot/branch/history/delete) in 32.6s for 50 samples (WAN-bound, not DreamDB-bound).
  • 1.33M imagenet-1k-256 ingest aborted at ~162K records due to local disk pressure; the experiment validated SDK + B2/B3 mechanics; the partial bucket was reclaimed during disk cleanup.

Pre-history

This changelog starts with the 10B-scale push. Earlier work (Phase 1-3.4: dataset platform, schema persistence, IVF+RaBitQ foundation, Ada-IVF maintenance, paged tracks, chain-aware lineage, browser SDK) is documented in design/0002-known-flaws-retrospective.md and the per-spec status fields. The protocol spec (spec/0000-0020) has been stable since 2026-05-13.