Bijou64: A variable-length integer encoding

(inkandswitch.com)

187 points | by justinweiss 6 hours ago

22 comments

  • kstenerud 5 hours ago
    The problem is that this breaks down once you try to use SIMD instructions. I'd developed a similar kind of approach to encoding integers (and ieee774 floats) a couple of years ago (first byte encodes length and first bit of data: https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b07186... ). It was very clever and used compiler intrinsics to get the length in 1 instruction, so 2 instructions got you the final value, with no branches.

    But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.

    The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

    • nine_k 5 hours ago
      I think these are different use cases. If you talk about SIMD, you talk about the CPU and efficient processing of large numbers of integers. I think that when a solution like this crops up, it's about storage or transmission, and dense packing at the cost of non-uniformity. It's more like time-series databases pack numbers by delta encoding.
      • hansvm 39 minutes ago
        I dunno. Varints in the wild tend to be misused, and there are external proto schemas at work we have to integrate with which would literally be both faster and smaller as gzipped json. They're misused because they have an API encouraging misuse -- compressing scalars rather than sequences. Varints are used because they can have reasonable developer ergonomics while sometimes improving computer metrics a twidge.

        On top of that, for the vast majority of performance/cost parameter spaces, you're better off both in developer ergonomics and speed/space slapping zstd across a flatter binary format, supposing no better tool fits your use case better. Especially if your messages aren't exceptionally tiny. You're not using them in a raw DB or doing raw bulk analysis on varints (else basically zero choices of parameters make varints win out), so you're transferring them somewhere and decoding them. That decoding step, even for highly optimized solutions like bijou64, is on par with (slightly better than, if you have an older datacenter link) your raw network. If you spend 1s on networking, you spend 1s on parsing. That's a bad tradeoff almost always, and that assumes a good varint solution.

        Even when varints make sense for some set of perf/cost parameters, it's still only for developer ergonomics 99.9999% of the time. Even simple changes like operating on a sequence of values rather than a single scalar enable vastly better CPU/space tradeoffs, and being willing to craft a proper data layout usually offers huge gains on top of that.

        It's interesting that you pick delta encoding (or, its natural extension, double-delta encoding often being valuable) for time-series databases as an example. That's an obvious case where you have a solution which is extremely cheap in storage/network/CPU. Varints suck comparatively, almost always.

        Not to rip on them too much, especially since it's nice to have primitives available which let you not have to do hard thinking for literally every problem, but they're not amazing and not a great default.

      • kstenerud 5 hours ago
        The thing is, most real-world numbers will fit within 1-3 bytes (even at 7 bits per byte), so ultradense packing doesn't actually buy much outside of benchmarks.

        I spent WAYYYYYYYY too much time exploring this...

      • the-lazy-guy 4 hours ago
        Stored and transmitted data also has to be decoded. With modern datacenter hardware bottleneck is often CPU rather than network or disk (SSD). It depends on specific properties of the data. (I used to work on search index implementation which is about decoding and intersecting large amounts of hit-lists; and right SIMD-friendly varint encoding is obviously crucial)
    • jaen 4 hours ago
      This doesn't seem particularly hard to SIMD, especially when the CPU architecture has "compress/expand" horizontal instructions. The first byte fully encodes the length, which is not harder than the continuation bits of (U)LEB128. It's a basically a common length-prefixed encoding with an extra subtract added in, so someone has probably figured out an efficient algorithm.

      It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.

      The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.

      • kstenerud 4 hours ago
        It can't be done, because the next bytes are dependent upon the first byte (which only works in limited circumstances, and where you have constant spacing between the values).

        ULEB128 works in SIMD because there's only one dependent bit per byte, so you can speculatively decode and then correct later cheaply. Bijou requires you to check the first byte and then branch based on the value using all 8 bits in the decision matrix (to handle branches 0-247, 248, 249, 250, 251, 252, 253, 254, 255). This absolutely DESTROYS any parallelization opportunities.

        Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.

        • jaen 4 hours ago
          Right, I think we have a slightly different definition of SIMD: You mean byte-parallel, I mean "doable with SIMD instructions". I also didn't imply the performance would be better than other methods...

          Even though decoding the lengths must be serial (since's there's no unambiguous way to differentiate a tag and data byte), it's still doable within the wider SIMD registers, so there's some theoretical efficiency gain to be had (depending on the shape of the data).

          On a general note, the continuation bit and prefix byte forms are equivalent, you just broadcast the prefix byte and compare against an increasing vector to convert it to a mask. Yeah, there's probably more fiddly SIMD if there are multiple prefixes in the register, but doable (it's just not byte-parallel, you eg. unroll the serial decode loop 8 times or whatever your maximum output byte width is, and mask out).

          Simplified:

            // Just maps a byte to its position in the register
            __m128i idx = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
            // Broadcast the prefix
            __m128i nn = _mm_set1_epi8((char)prefix_byte);
            // Get applicable locations: prefix_byte contains the length, if byte_pos < len, the corresponding byte will be set
            __m128i m = _mm_cmpgt_epi8(nn, idx);
            // If you *really* want a high-bit mask:
            m = _mm_and_si128(m, _mm_set1_epi8((char)0x80));
          • kstenerud 3 hours ago
            Yeah, sorry, I didn't say that very well. Single value decoding of Bijou values is of course trivial in SIMD, but the performance benefits of SIMD come from deterministic boundaries across a window. ULEB128's continuation bit is fixed position, so it's data independent. One pmovmskb gives you every boundary in the window.

            Interleaved Bijou has no such signal (tag and payload bytes both span 0x00–0xFF), so finding the boundaries is a dependent per-value walk with no opportunities for parallelism.

            • NooneAtAll3 0 minutes ago
              would vector instruction be of any help? (variable length simd)
            • jaen 3 hours ago
              There's still speculation though - if eg. most values are of 1 or 2-byte length, you can speculate that any control-valued byte is actually control. You can even do a compensation pass to try to fix some amount of mis-speculations, and then bomb out if that fails.

              With that, it's mostly byte-parallel (though data-dependent as I mentioned).

    • itishappy 2 hours ago
      > The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.

      Can you explain this part a bit? I feel like intuitively (and therefore probably incorrectly) these should have the same difficulties.

  • wahern 3 hours ago
    This reminded me of ISO 7816-4 BER-TLV encodings, which uses the format defined in ISO/IEC 8825-1 (ASN.1 related spec). Length integer values of 0-127 are encoded in 1 byte. If the high bit is set, then the first 7 bits tell you the number of subsequent octets. So there's no offsetting involved, making it slightly less compact, but also dead simple.

    EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:

      -- The Yubikey 4 has an off-by-one bug which
      -- declares tag length of 255 (for the 0x53 outer
      -- tag of a certficate DO) when there are only 254
      -- bytes remaining in the reply. The reply is
      -- chained across two packets, but the off-by-one is
      -- probably related to the over-long encoded length
      -- (0x82 0x00 0xff instead of 0x81 0xff).
      --
      -- [snip packet captures]
      --
      -- Yubico's ykpiv_fetch_object function in ykpiv.c
      -- (confirmed 1.4.3-1.5.0) contains a read (memmove)
      -- overflow when the declared inner BER-TLV length
      -- (of the 0x53 tag) is longer than what was
      -- received over the wire. That makes Yubico's
      -- library oblivious to the issue. Relatedly, the
      -- set_length function has an off-by-one bug (length
      -- < 0xff instead of length <= 0xff) which produces
      -- an over-long encoded length. That doesn't by
      -- itself explain why the Yubikey 4 transmits a
      -- truncated logical reply unless the same code is
      -- being used.
  • i2talics 4 hours ago
    Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.

    The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!

    The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.

    • __s 2 hours ago
      I've often done same thing with encoding msgpack maps while streaming in key/value pairs
      • i2talics 1 hour ago
        Neat! It's a useful technique whenever you don't know or want to defer knowing the size of an integer until a later time, but need to allocate space for it up front.

        I'm wary of introducing these forced-canonical encodings by someone hyper focused on "efficiency" and "security" without reconsidering additional use cases.

  • stebalien 5 hours ago
    I've used LEB128 (with canonicalisation) extensively and... this looks so much nicer for most use-cases (length prefixed, supports the full uint64 range without that extra 10th byte).

    The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.

    [1]: https://github.com/multiformats/multicodec

    • b_fiive 5 hours ago
      sup steb, this is expede's work!
      • stebalien 1 minute ago
        Sup b5! I always look forward to new work by expede (and n0).
  • boricj 5 hours ago
    I'm working on a C++ library at work that binds wire data and application data through token and model layers, which includes among other things a fair amount of tokenizers/composers for various formats (JSON, CBOR, BSON, CSV...).

    This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.

    LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.

    I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.

  • omoikane 5 hours ago
    UTF-8 has the same issue ("overlong encoding") where multiple representations are possible the same code point. Someone proposed a similar tweak to remove the overlapping ranges by adjusting the base offset for byte sequences that are longer than 1. That was discussed here:

    https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)

    This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.

  • arkenflame 3 hours ago
    I researched many different varint encodings for a GraphQL-specific binary format (resulting in Argo: https://github.com/msolomon/argo ). I ended up choosing protobuf-style zig-zag varints, but I also found these interesting:

    vu128: https://john-millikin.com/vu128-efficient-variable-length-in...

    metric/imperial varint: https://dcreager.net/2021/03/a-better-varint/

    vectorizing VByte: https://arxiv.org/abs/1503.07387

  • billpg 5 hours ago
    I forget where I encountered it, but I've seen similar encodings that eliminated the possibility of many possible encodings for the same number by making the length part of the value.

    Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.

    10000000 00000000 is the only way to represent 128.

    10000000 10000000 00000000 is the only way to represent 16512.

    Does this encoding have a name?

  • conaclos 4 hours ago
    This is pretty close to SQLite's varints [0]

    [0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki

    • adzm 2 hours ago
      I believe SQLite3 uses a somewhat different implementation:

      > A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more bytes which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are more significant than bits taken from the later bytes.

      from https://www.sqlite.org/fileformat2.html#varint

      The one you linked for SQLite4 (abandoned project) is probably a better approach. I recall that the author has said that SQLite3's varint implementation is regretful.

  • willtemperley 5 hours ago
    Maybe someone can explain why an encoder would ever create the padding bytes allowed in LEB128. I contributed the parser for LEB128 in apple/swift-binary-parsing and I’m still none the wiser. I’m genuinely mystified.
    • scottlamb 5 hours ago
      I can think of two reasons.

      The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.

      The second is that they are implementing a spec that requires appending a varint length-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...

    • esrauch 5 hours ago
      Let's say you are writing into a byte[] and have a LEB128 length-prefix followed by a payload, but that determining the length actually involves nontrivial encoding work. For example, you have a UTF16 string and want to write out a UTF8 string, you want to go over the characters and write them out, but the UTF8 length is not known without doing all of that work.

      If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.

      But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).

      If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.

    • cornstalks 5 hours ago
      It allows you to fill in padding in a buffer. For example, all data in a buffer will be interpreted by a downstream system, and someone pre-calculated the size of that buffer. Rather than encode everything twice (once to figure out the exact size needed, and a second time to actually populate the buffer) the buffer size was calculated using foreknowledge of how many values would be written to that buffer and then just pessimistically assuming all of them are max-size so writing will never fail. Another situation is when you're rewriting part of an already-encoded file. If you want to change a bit of payload then using padding bytes gives you more flexibility so you can do that without having to do any memcpy into a new buffer.

      It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.

    • i2talics 1 hour ago
      It's useful whenever you don't know the value of an integer but would like to allocate space for it now, and then fill in the value later. Many have mentioned length-prefixed data, which is a good example. Another use case is static linking. I believe LLVM uses this when generating WASM object files.
    • axod 5 hours ago
      Maybe you want to byte align some data, or pack to a certain size but keep compat. I think they're going to be rare cases, but I can see it being used.
    • layer8 5 hours ago
      The issue is that non-unique encodings are an attack vector, because parsers may in practice behave differently for noncanonical (or nominally invalid) encodings.
      • kbolino 5 hours ago
        For example, you have an envelope format that goes: length prefix in LEB128, message, signature. One party controls the length prefix and signature, a different party controls the message. The message-writing party carefully crafts the message so that, in isolation, it appears innocuous, but when wrapped in the envelope, the first few bytes of the message look like continuation of the length prefix. Best case is the receiving party safely fails to parse the message, worst case is the receiving party successfully parses the message, verifies its digital signature, and interprets it differently than the signer did.
    • boricj 5 hours ago
      Laziness probably. Maybe there's an argument if you want to avoid branches and just blast the integer out in a fixed number of statements/instructions/bytes, but that sounds a bit fringe.

      I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.

    • Chaosvex 5 hours ago
      You wouldn't. It's a strange argument that can be countered with, "maybe don't do that?"
      • willtemperley 5 hours ago
        So why does the spec allow it? Like a good engineer I read the spec and tested against the over-wide example encodings given.
        • Chaosvex 5 hours ago
          Because it's not a real standard and there is no blessed RFC for it. The DWARF spec is as close as you'll get and it says, "The integer zero is a special case, consisting of a single zero byte." So in a way, it doesn't.

          Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.

          Edit: to clarify, I was talking about the author's argument being strange, not yours.

          • willtemperley 5 hours ago
            The WASM spec is more explicit about over-long LEB128 encoding.

            Edit: a properly written decoder is a lot more than 10 lines if you properly deal with integer overflow and both signed and unsigned ints.

  • aDyslecticCrow 4 hours ago
    Clever, but one thought crossed my mind;

    An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.

    It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.

    • gregschlom 4 hours ago
      You have the same issue with LEB128 though, right?
      • aDyslecticCrow 4 hours ago
        LEB128 can only trick you by at most one byte, (depending on the followup data). Bijou64 can consistently trick you by 8 bytes.

        In a contrived example of a pbuf {length:int, payload:byte[1]}

        LEB128 can trick you into reading the payload as part of the length, but then hopefully trigger a code check against invalid buffer read. (or one byte outside the struct if the payload is also malicious)

        Binou64 can trick you to read 7 bytes into other memory, before any buffer size validation is done.

        It's then not uncommon to log with a helpful; "buffer with length: 26624894573377(7 bytes of stolen data) is invalid", or just crash.

        It's to the point that Bijou64_decode should perhaps take "end_adress" or "max_read" to catch this kind of attack.

        (If you dont validate a malicious pbuf, you're in for a bad time regardless of integer format, but these int formats add their own way to trigger a buffer overrun despite a proper check.)

      • pixelesque 4 hours ago
        Very likely, but isn't this post claiming that bijou64 is safer than LEB128 for the situation of adversarial varints?
  • nine_k 5 hours ago
    In short: instead of a truly indefinite-length solution with a signal bit on the current byte saying whether to check the next byte, this uses a counter. Values 0x0 to 0xF7 are one-byte integers, 0xF8 to 0xFF use the upper 5 bits as a counter for the number of subsequent bytes. This limits the maximum magnitude to slightly less than 2 ^ 264 (almost all 33-byte values), which seems to be okay for practical computations. The proposed standard limits the supported size to u64 though.

    The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.

    > ...adversarial input, which is rarely in the test suite.

    This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.

    • onlyrealcuzzo 5 hours ago
      > I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.

      They might have a different definition of adversarial than you.

      > My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.

      This doesn't seem like what I would call adversarial.

      This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.

  • harrisi 4 hours ago
    I'm surprised there's no mention in the post or here about SQLite's varint encoding. Not that it would necessarily satisfy the constraints, but it's one of the most used varint implementations.
    • dgllghr 4 hours ago
      It's not terribly fast. It's faster than LEB128 but not as fast as vu128 (at least according to https://github.com/Jiboo/varint_benchmark)
      • harrisi 4 hours ago
        The post says the purpose of exploring this space (which is a fun one) wasn't speed, but representation. The speed gain was an added value.

        I'm not saying SQLite's varint implementation is ideal for every application. It's just an implementation that is one of the most used implementations, if not the most (I'd bet it is by a large margin though). It just seemed like a missed opportunity to compare it with the implementation they landed on.

        EDIT: Just wanted to add, thanks for sharing that link. Interesting!

  • michaelmure 4 hours ago
    One nice upside of having a single way to encode a value is fuzzing: when you work on an encoder/decoder, you can use a fuzzer and do round-trip comparison until you find crashes or inputs/outputs that don't match (and therefore issues in the code). But with LEB128 for example, the fuzzer quickly learn about those alternatives encoding and there is not much you can do from there.
  • apitman 2 hours ago
    This reminds me of the varint encoding used by QUIC, but I've never implemented it. Anyone know the differences?
    • raphlinus 1 hour ago
      Similar, in that it encodes length in the first byte. The differences are:

      * It does not require canonicality, it allows multiple encodings of the same value. To make things even more fun, the QUIC spec requires shortest encoding in some uses but not others.

      * It uses 2 bits rather than cutting out a range.

      * It only encodes values up to 62 bits long.

      So, some similarities but also some differences.

      [1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-le...

  • yread 4 hours ago
    Wouldn't something like this also work: https://en.wikipedia.org/wiki/Elias_omega_coding I've used to great effect for compression
  • HansHamster 5 hours ago
    It feels a bit unfair to say that it is faster by being able to tell the total length from the first byte and capping it at 64 bit, while some of the other formats can store arbitrarily large integers. I guess you could use another variable length encoding for the prefix at the cost of some performance and using even more space...
    • petermcneeley 5 hours ago
      esp when the number is capped at only 64 bits which is quite small for some bigInt style numbers.
  • MarkusQ 3 hours ago
    > This causes problems for signed data

    Given that the context up to this point had been representation of integers, I initially trip on this. :)

  • cantalopes 5 hours ago
    I love the random hyperlink underlines on that page
  • amluto 5 hours ago
    Just a quick reminder:

    > This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.

    If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.

  • alex-reyss 5 hours ago
    [dead]
  • RedShift1 5 hours ago
    This seems quite convoluted just to avoid the "0 can be represented in more than one way" problem.
    • bjoli 5 hours ago
      Having all numbers be valid in only one way is a great idea. So much that I believe webassembly enforced canonical leb128, at the cost of decoding speed.

      And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.

    • matja 4 hours ago
      > canonicality matters — for signatures, content-addressing, or any kind of “two implementations must agree on the bytes” property

      If you don't do this properly, you end up with things like: - SAML XSW attack due to XML signature wrapping - ASN.1 BER/DER signature forgery - Bitcoin transaction malleability attacks

    • nine_k 5 hours ago
      It allows finding out the length (and allocating memory) after reading the first byte.
    • ape4 5 hours ago
      Comparing a number to zero is something that's done a lot
      • Chaosvex 5 hours ago
        True but also not particularly relevant?
    • ahoka 5 hours ago
      I think it's neat.