Skip to content

The BSON engine

BSON is MongoDB's binary document format: length-prefixed, typed, and traversable without parsing the whole document. BisonDB implements its own codec — encoder, decoder, and the Extended JSON text mappings — and validates it against MongoDB's official conformance corpus.

The value model

In memory a document is a Value: a tagged union (std::variant) over the 11 supported BSON types. Documents preserve insertion order — BSON is an ordered map, so the obvious std::map<std::string, Value> would be wrong; internally it's a vector of key/value pairs with linear key lookup.

TypeTag bytePayload
Double0x018-byte IEEE 754, little-endian
String0x02i32 length including NUL, then UTF-8 bytes, then 0x00
Document0x03a nested document (same format, recursively)
Array0x04a document whose keys are "0", "1", ...
ObjectId0x0712 raw bytes
Bool0x081 byte, must be 0 or 1
DateTime0x09i64 ms since epoch
Null0x0Anothing
Int320x10i32
Int640x12i64
Decimal1280x1316 raw bytes (IEEE 754-2008 BID)

Decoding a real document, byte by byte

The document {"name": "ada", "age": 36} is 27 bytes on the wire:

OffsetBytes (hex)Meaning
0–31B 00 00 00total document size = 27, little-endian
402element type: String
5–96E 61 6D 65 00key "name", NUL-terminated cstring
10–1304 00 00 00string length 4 (3 chars + trailing NUL)
14–1761 64 61 00"ada" + NUL
1810element type: Int32
19–2261 67 65 00key "age"
23–2624 00 00 0036, little-endian
2700document terminator

(The terminator is byte 26 zero-indexed — the declared size counts it.) Every structural fact in that table is something a hostile input can lie about, which leads to:

A decoder that assumes hostility

The decoder treats input as adversarial — the same bytes arrive from disk and from the network. Every read is bounds-checked against both the buffer and the document's declared size, and element parsing must land exactly on the terminator:

  • declared size < 5 or beyond the available bytes → reject
  • string length < 1, or extending past the document → reject
  • a sub-document whose declared size eats its parent's terminator → reject
  • bool bytes other than 0/1, unknown type tags → reject
  • invalid UTF-8 in keys or strings → reject
  • nesting beyond 200 levels → reject (this is a stack-overflow guard, not pedantry)

Every failure throws with the byte offset of the problem. There is no lookahead and no "resync" — a framing violation means the stream cannot be trusted.

Corpus testing

The decoder/encoder pair runs against the official MongoDB BSON corpus for all 11 types: every canonical document must decode and re-encode to identical bytes, every degenerate encoding must normalize to canonical, and every listed decode error must throw. On top of that, the fixture suite round-trips real mongodump files — 29,470 ZIP-code documents re-encode byte-for-byte.

Extended JSON

Text I/O uses MongoDB Extended JSON v2 in both modes:

  • Relaxed — human-friendly: plain numbers, {"$oid": "..."}, ISO-8601 $date.
  • Canonical — lossless: {"$numberInt": "36"} wrappers preserve exact types, so bson → JSON → bson reproduces the original bytes. The converter's round-trip tests depend on this.

The JSON parser is a from-scratch RFC 8259 recursive-descent parser (with surrogate-pair handling and the same 200-level depth cap), extended with the $-wrapper folding — and, for the shell, a relaxed mode accepting unquoted keys, single quotes, and trailing commas.

A note on Decimal128: BisonDB stores the 16 bytes opaquely and implements full binary-integer-decimal string conversion in both directions (including the clamping and non-canonical-significand rules), but no arithmetic — you can store and display 19.99 exactly; you cannot ask the database to add to it.

BisonDB and Prairie are GPLv3 · educational projects.