Skip to content

The B+Tree

This is the page the whole project was built to earn. BisonDB's indexes are a hand-written, on-disk B+Tree — no std::map standing in for the hard part, no storage library. It maps encoded key bytes → small values (a log offset for the _id index; nothing at all for secondary indexes, whose keys are self-contained) and is verified by 100,000-operation randomized fuzzing against std::map as an oracle.

Why a B+Tree at all

Hash indexB+TreeLSM tree
Point lookupO(1)O(log n)O(log n), multi-level
Range scanimpossiblesequential leaf walkmerge across levels
Ordered iterationnoyesyes
Write patternrandomrandom page writesappend-only, compaction
Implementation sizesmallmoderatelarge (levels, bloom, merge)
Used byRedis hashes, Python dictSQLite, Postgres, MySQLRocksDB, Cassandra

Queries like {pop: {$gte: 40000}} are range queries — a hash index cannot answer them at all (BisonDB's Phase-2 startup hash could only find exact _ids). An LSM tree wins on write-heavy workloads but costs far more machinery. The B+Tree is the smallest structure that makes range queries cheap, which is exactly what explain's 29470 → 1015 docsExamined collapse demonstrates.

Pages and the pager

An index file is an array of fixed 4096-byte pages. Page 0 is the header (magic BSNI, root page id, free-list head, page count, and the clean flag from the storage page). Freed pages form a linked free list and are reused before the file grows. A small LRU cache (256 pages) with write-back keeps hot pages in memory; sync flushes dirty pages, fsyncs, then sets the clean flag.

Node layout: slotted pages

Both node kinds share one layout — a 12-byte header, a growing slot array, and cells growing backward from the page end:

offset 0                                                        4096
┌────────────┬───────────────────────┬─────────┬────────────────┐
│ header 12B │ slot array (u16 × n)  │  free   │ cells ←        │
│            │ sorted by key, grows →│  space  │ grow backward  │
└────────────┴───────────────────────┴─────────┴────────────────┘
Header offsetField
0node type: 1 = leaf, 2 = internal
2–3cell count (u16)
4–5free-space offset — the lowest cell start
8–11leaf: right-sibling page id · internal: rightmost child id

A leaf cell is keyLen u16 │ key │ valLen u16 │ value. An internal cell is keyLen u16 │ key │ childPageId u32, where the child holds keys less than the cell's key, and the header's rightmost child catches everything ≥ the last separator.

Why slots instead of packed records? Binary search needs random access to the i-th key, but keys are variable-length. The slot array gives O(log n) search over offsets that are cheap to shift on insert; deleting a cell just removes its slot and leaves a hole, and a page compacts its cell area only when a later insert actually needs the space.

Leaves are also chained left-to-right through the sibling pointer — that single u32 per page is what turns "find the start of the range" into "now just keep reading."

Key encoding

The tree compares keys with plain memcmp. All type-aware ordering is pushed into the encoding, which must therefore be order-preserving: encoded bytes compare exactly like the values they represent. Each key starts with a type-class tag (giving the cross-type order Null < Numbers < String < ObjectId < Bool < DateTime), then a payload:

TypeEncoding
NumbersInt32/Int64/Double normalized to double, bits transformed (below), big-endian
StringUTF-8 with 00 escaped as 00 FF, terminated 00 00
ObjectId12 raw bytes
Boolone byte
DateTimei64 + 2⁶³ bias, big-endian

Keys cap at 512 encoded bytes; longer values (and NaN, and documents/arrays) simply aren't indexed — those documents are still found by full scans, and index builds count them as skipped.

The double sign-flip trick

IEEE 754 doubles almost sort like their bit patterns — for positive numbers. Negative numbers have the sign bit set (so they'd sort above positives unsigned) and grow downward in magnitude as their bit patterns grow. Two transformations fix both:

  • sign bit clear (positive): set the sign bit → moves positives above all negatives
  • sign bit set (negative): flip all 64 bits → reverses the negatives' order and clears their sign bit
ValueIEEE bitsAfter transformUnsigned order
−2.0C000…00003FFF…FFFFsmallest
−1.0BFF0…0000400F…FFFF
0.00000…00008000…0000
1.03FF0…0000BFF0…0000
2.04000…0000C000…0000largest

(−0.0 is normalized to +0.0 first so the two zeros collide, matching numeric equality.) Stored big-endian, memcmp now sorts doubles numerically. The string escaping serves the same master: 00→00 FF with a 00 00 terminator keeps "a" < "a\0" < "aa" correct under bytewise comparison even with embedded NULs. A 30,000-case fuzz test asserts memcmp(encode(a), encode(b)) agrees with a direct value comparator on random pairs.

Composite keys for secondary indexes

A secondary index on pop stores encode(popValue) ‖ 00 ‖ _id — twelve extra bytes make every key unique (duplicate field values differ in id), keep entries grouped by value, and let a range scan recover the _id from the key itself without storing any cell value.

Insertion and splits

Descent is standard: binary-search each internal node's separators, follow the child, insert into the leaf. When a page can't fit the new cell even after compacting its holes, it splits:

Leaf splits copy the separator up (the key stays in the leaf — range scans need every key at leaf level); internal splits move their median up. A root split allocates a new root with one separator, which is the only way this tree gets taller — so it grows perfectly balanced, by construction.

Deletion: deliberately lazy

Textbook B-trees rebalance on underflow — borrow from a sibling, else merge, possibly cascading. BisonDB does something simpler: remove the cell, and only when a page hits zero cells unlink it (stitch the sibling chain via its in-tree predecessor, drop the parent's reference, collapse single-child roots).

The rationale is a cost/benefit honestly weighed: rebalancing buys back space earlier but adds the most error-prone code in the textbook algorithm. Lazy deletion keeps pages valid at all times (search and scans never care that a page is half-empty), reclaims fully-empty pages through the free list, and leaves the worst case — a long-lived half-empty tree after mass deletion — to the same compact operation the log already needs, which rebuilds indexes densely from scratch. The 100k-op fuzz hammers exactly this path with mixed inserts and erases.

Recovery, again

Nothing on this page changes the storage layer's rule: the tree flushes dirty pages on sync and marks itself clean; any crash leaves the flag unset, and the next open throws the file away and rebuilds from the log. The B+Tree never has to be repairable — only detectably broken — which removes an entire class of subtle corruption bugs from the design space.

Concurrency

One writer or many readers, enforced by a tree-level reader-writer lock; cursors hold the read lock for their lifetime. Page-level latching (what production engines do) was explicitly scoped out — see concurrency.

BisonDB and Prairie are GPLv3 · educational projects.