TurboQuant: A First-Principles Walkthrough

(arkaung.github.io)

47 points | by kweezar 3 hours ago

5 comments

amitport 55 minutes ago
TurboQuant IS restricted EDEN quantization (NeurIPS 21, ICML 22). Missing the optimal scale derivations, which makes TurboQuant variant considerbly LESS accurate than those works. We show this throughfully in a new note in https://arxiv.org/abs/2604.18555 .
We were the first to introduce post-rotation distribution aware quantization in 21, which was LATER implemented in many fields including federated learning, vector retrieval, databases, inference engines, and KV-cache.
It would be nice to get some credit of this. And it is certainly baffling to see the name "TurboQuant" repeated in this context, considering the many works from 21 onwards.
The blog post above basically goes you through EDEN quantization, but then ends settling with a less than optimal MSE-minimizing version and an unbiasing trick that often costs a full bit more than DRIVE/EDEN need for the same results (with the unbiasing scale, shown in the original 21 paper).
[-]
- KnuthIsGod 0 minutes ago
  https://arxiv.org/abs/2604.18555
  "This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to . EDEN supports both biased and unbiased quantization, each optimized by a different (chosen via methods described in the EDEN works). The fixed choice used by TurboQuant is generally suboptimal, although the optimal for biased EDEN converges to as the dimension grows; accordingly TurboQuant approaches EDEN's behavior for large . Second, TurboQuant combines a biased -bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its -bit step uses the suboptimal ; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased -bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with -bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized ) is more accurate than TurboQuant, and unbiased EDEN is markedly more accurate than TurboQuant, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried."
- 0xbadcafebee 2 minutes ago
  [delayed]
linuxhansl 1 hour ago
I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
[-]
- everythingctl 1 hour ago
  Maybe we can run more powerful models locally.
  I thought the principal consequence of these KV cache optimisations was letting you run more simultaneous inferences on the same model with the same memory. It doesn’t let you store more model. In some sense that puts local LLM usage at a further disadvantage to inference done in a hyperscaler’s data center.
jarbus 38 minutes ago
This is incredible. Interactive demos like this make mathematics 10x more accessible
TranspectiveDev 1 hour ago
[dead]
iggerews 1 hour ago
[dead]