Skip to content

The storage layer

Each collection's documents live in exactly one file: an append-only record log. This page explains the format, the crash-safety reasoning, and why "append-only plus rebuildable indexes" beats cleverer designs for a database you intend to fully understand.

Record framing

<collection>.log is a flat sequence of records:

OffsetSizeField
01type: 01 = PUT, 02 = DEL
14payload length, u32 little-endian
5npayload

A PUT payload is a complete BSON document (which must carry an ObjectId _id; one is generated if absent). A DEL payload is the 12 _id bytes. That's the entire format.

Updates do not modify bytes in place — ever. An update appends a new PUT for the same _id; a delete appends a DEL. The old record becomes garbage that compaction reclaims later. Note the BSON payload carries its own length and strict internal validation, so a corrupted record body is detected at decode time even though the log has no separate per-record checksum.

Replay: the recovery rule

Replaying the log front to back, letting the last record per _id win, reconstructs the exact live document set.

This is the contract everything else leans on. The _id index maps ids to PUT offsets, and secondary indexes map field values to ids — but both are derived. The startup sequence makes that concrete:

"Clean" is a one-byte flag in each index file's header: cleared (and flushed) before the first write of a session, set again only after a full successful flush. A crash at any point leaves it cleared, and the next open pays an O(n) rebuild instead of trusting possibly-torn pages. Indexes are caches; the log is the database.

Torn writes

What if the crash happens during an append? The replay loop reads a record header, checks that the full payload fits within the file, and stops silently at the first incomplete record. A torn tail is defined as not-yet-written data, not corruption:

The key observation: the torn record was never acknowledged to the client (responses are sent after the append returns), so dropping it preserves exactly the guarantee that was promised. There is no half-applied state because nothing besides the log needs to be consistent.

Why this design

The classic alternatives and what they'd cost here:

  • Update-in-place (heap files) needs free-space management, record relocation when documents grow, and write-ahead logging to make any of it crash-safe — a WAL plus the files it protects. The append-only log is its own WAL.
  • A full WAL + page-store (the SQLite/Postgres shape) is the production answer, but it doubles the moving parts. With one append-only file, fsync discipline reduces to "flush the log"; recovery reduces to "find the tear."
  • The trade: reads need an index lookup plus a seek (no clustering), deleted space accumulates until compaction, and startup after an unclean shutdown is O(data). For a single-node educational engine, all three are the right price.

Compaction

compact rewrites the log keeping only live PUT records (in _id-index order), fsyncs the new file, then atomically renames it over the old one. Since every stored offset just changed, all indexes are rebuilt afterward — simple beats subtle here; remapping offsets in-place was rejected as complexity with no observable benefit at this scale. A crash mid-compaction is safe: the rename is the commit point, and until it happens the original log is untouched.

Tombstones and space

A deleted document costs 17 bytes of DEL record forever-until-compaction, and its old PUT remains as dead weight. dbStats reports per-collection file sizes so the moment to compact is visible rather than mysterious.

BisonDB and Prairie are GPLv3 · educational projects.