[first blog in 2026, let's see for how long ~~~]
There's a complaint I keep hearing from engineering teams working with large AI models: "The model is great, but the longer the input, the slower and more expensive it gets."
There's a complaint I keep hearing from engineering teams working with large AI models: "The model is great, but the longer the input, the slower and more expensive it gets."
And here's the thing — that's not a configuration problem. You can't fix it with better hardware tuning or a smarter deployment setup. It's a fundamental architectural constraint, rooted in how transformer-based models store and access information.
To understand why, you need to know what a KV cache is. Every time a transformer model processes a token, it needs to look back at everything it's seen before to decide what's relevant. To do that, it stores a compressed representation of every past token in what's called a key-value (KV) cache. That cache grows proportionally with model size, number of layers, attention heads, and — critically — context length. For large models handling long documents or extended conversations, the KV cache becomes enormous. And enormous memory means high latency, high cost, and hard limits on how many users you can serve at once.
The most obvious fix is to compress the cache. In the world of vector quantization, that means converting high-precision floating-point values into a lower-bit representation — instead of 16 bits per value, store 4. Or 2. The data footprint shrinks dramatically.
But there's a trade-off that's been treated as unavoidable: fewer bits means lost information, and lost information means degraded accuracy. Existing methods were stuck between two bad options. Offline methods produce excellent compression — but they require retraining or calibration data, which is a non-starter for production systems that need to respond in milliseconds. Online methods work immediately without any calibration, but they're mathematically suboptimal — there's always a gap between what they achieve and the theoretical best possible compression. TurboQuant was built to close that gap: online, no training, no calibration — and mathematically approaching the lowest possible distortion for any quantization method.
The key insight is surprisingly simple: rotate the vectors randomly before quantizing them. I know that sounds almost too easy, but stay with me.
A random rotation doesn't change the information content of a vector — the distance between any two vectors remains exactly the same after rotation. What changes is the internal structure of the values. In an unrotated vector, the distribution across dimensions is uneven. Some dimensions carry enormous information, with values far from average. Others carry almost nothing. If you apply uniform quantization to that, you end up destroying precision exactly where it matters most.
After random rotation, the distribution changes fundamentally. The TurboQuant team formally proved that each coordinate of a rotated high-dimensional vector follows a Beta distribution — and in the high dimensions typical of modern AI models, this converges toward a near-Gaussian with very small variance. The dimensions become nearly independent of each other, and when dimensions are independent and their distribution is known, you can quantize each one optimally without worrying about cross-dimension interactions.
There's one more piece to the puzzle. Quantization that minimizes reconstruction error doesn't automatically give you accurate inner products — the dot products that are literally how attention mechanisms measure relevance between tokens. If quantization introduces systematic bias here, your model degrades in subtle but serious ways. TurboQuant handles this with a two-stage approach: first, quantization optimized for minimum reconstruction error using the random rotation plus precomputed codebooks; second, the residual gets re-quantized using a 1-bit method based on the quantized Johnson-Lindenstrauss transform, which is mathematically proven to produce unbiased inner product estimation. The result is a single system that's optimal for both goals — two things that were previously considered incompatible without training.
In benchmarks on language model KV caches, TurboQuant achieves absolute quality neutrality at 3.5 bits per channel — model performance doesn't drop at all. Push down to 2.5 bits and degradation stays marginal. Overall: more than 4x compression compared to full-precision representation.
For engineers building real systems, this matters in a very practical way. The gap between "impressive benchmark" and "deployable in production" is almost never about model accuracy — it's about operational cost. How much memory does it need? What's the latency? Can it run on the hardware you already have? A model with a KV cache that's 4x smaller can be deployed on cheaper cloud instances, on edge devices that previously couldn't handle it, or serve 4x more users on the same infrastructure. TurboQuant moves the line between "feasible" and "not feasible" — without touching the model itself.
The breakthrough here wasn't more data, bigger models, or fancier hardware. It came from asking a different question: is the trade-off between speed and accuracy actually inevitable — or does it only look inevitable because of how we've been framing the problem? Turns out, it was the framing. One random rotation before compression, and the trade-off disappears.
That's a useful reminder far outside the world of quantization. A lot of constraints that feel fundamental turn out to be artifacts of the angle you've been looking from. Change the angle, and the wall isn't there anymore.
The paper is on arXiv, number 2504.19874. It was written by Amir Zandieh of Google Research, Majid Daliri of New York University, Majid Hadian of Google DeepMind, and Vahab Mirrokni of Google Research.
30 March 2026
Potato Codex
----------------
This episode is available on Spotify in Bahasa Indonesia. For other courses, ebook, source code, or any ways to connect, visit → linktr.ee/potatocodex
30 March 2026
Potato Codex
----------------
This episode is available on Spotify in Bahasa Indonesia. For other courses, ebook, source code, or any ways to connect, visit → linktr.ee/potatocodex