TurboQuant is a restricted version of EDEN quantization (NeurIPS 21, ICML 22). It lacks the optimal scale derivations, which makes the TurboQuant variant considerably less accurate than those works. We show this thoroughly in a new note at https://arxiv.org/abs/2604.18555.
We were the first to introduce post-rotation distribution-aware quantization in 2021. This was later implemented in many fields, including federated learning, vector retrieval, databases, inference engines, and KV-cache.
It would be appropriate to receive credit for this. Furthermore, it is baffling to see the name "TurboQuant" repeated in this context, considering the many works published from 2021 onwards.
The blog post mentioned above essentially guides you through EDEN quantization but ultimately settles on a sub-optimal MSE-minimizing version and an unbiasing trick. This trick often costs a full bit more than DRIVE/EDEN requires to achieve the same results using the unbiasing scale shown in the original 2021 paper.
mskkmtoday at 8:20 AM
The public comments on Openreview now include explicit allegations that the TurboQuant paper knowingly misrepresented RaBitQ and understated RaBitQ’s results. The RaBitQ authors also report in a technical note that several of TurboQuant’s runtime and recall numbers do not reproduce from the released code under the paper’s stated setup. In the note, TurboQuant generally loses to RaBitQ: https://arxiv.org/abs/2604.19528. If these public allegations hold up, then this is not just overhype or sloppy citation practice, but points to a distorted comparison and benchmark claims that do not survive reproduction.
linuxhansltoday at 3:22 AM
I am fascinated by this and similar research (RotorQuant, etc). It seem by next year we will be able to run this year's largest models on last year's hardware. :)
Maybe we won't need as many data centers and as much power as we thought. Maybe we can run more powerful models locally.
treexstoday at 6:49 AM
I feel like I've gotten really good at noticing which model generates what type of site and this oozes codex
gcrtoday at 11:33 AM
On TheTom’s llama-cpp fork, TurboQuant makes inference about five to ten times slower than vanilla (M1 Max, qwen3.6-35b-a3b). Seems like the productionization is still a ways away.
jarbustoday at 4:18 AM
This is incredible. Interactive demos like this make mathematics 10x more accessible
deletedtoday at 7:16 AM
nafistihamtoday at 8:10 AM
Thanks a lot. It helped me get a much more detailed view of turboquant than a few youtube videos that I watched. Also, the choice of color is excellent as it serves both light and dark mode. I'll try to use it in my sites. Kudos!