Bijou64: A variable-length integer encoding
125 points - today at 3:03 PM
SourceComments
But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.
The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!
The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.
It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesnβt confuse you.