Visual Essay Series · TurboQuant+

TurboQuant A Visual Journey Through KV-Cache Compression

From naive quantization to PolarQuant, Walsh-Hadamard rotations, and the surprising lesson that made turbo4 beat q4_0 — told through interactive visualizations.

explore the series
Six modules. One complete picture.
01
Foundation
Why Naive Quantization Fails
KV-cache vectors have a hidden enemy — outliers. A handful of extreme values force your quantization scale to span the entire range, wasting every slot on the coordinates that matter most.
uniform quantization outliers slot width reconstruction error
02
Core Technique
Random Rotations & Walsh-Hadamard Transform
Rotate the vector before quantizing — outlier energy spreads evenly across all dimensions. After rotation, every coordinate follows the same predictable bell curve N(0, 1/d). The WHT does this in O(d log d).
WHT butterfly N(0, 1/d) random signs norm preservation
03
Algorithm
PolarQuant — Optimal Scalar Quantization
Now that we know the distribution is always N(0, 1/d), we can design optimal slot positions using Lloyd-Max. More slots near zero where data lives. One codebook works for every vector forever.
Lloyd-Max optimal centroids 7× compression norm correction
04
Critical Finding
The QJL Lesson — Why Error Correction Hurts
TurboQuant added a clever 1-bit residual correction called QJL. It reduced average error — yet made attention quality worse. The reason: softmax amplifies variance exponentially. More centroids beats error correction. Every time.
QJL residual bias vs variance softmax amplification turbo4 resurrection
05
Advanced · Practical
Asymmetric K/V + Advanced Techniques
The most important practical finding: compressing V is essentially free, while K precision is everything. Softmax amplifies K errors exponentially; V errors scale linearly. Then — three orthogonal optimizations that stack for free: Boundary V protects critical layers, Sparse V skips negligible tokens (+22.8% speed), and Block size 128 eliminates redundant norm storage (+12% compression).
asymmetric K/V softmax sensitivity boundary V sparse V block size 128 +22.8% throughput
📓 Python Notebooks
Source Code · GitHub
TurboQuant Tutorial — Python Notebooks
All six modules as runnable Jupyter notebooks. Includes the full TurboQuant library, codebook implementations, rotation utilities, and the experiment code behind every chart in this series.
View on GitHub
The conceptual chain — how ideas connect
01
⚠️

The Problem

Outliers stretch the quantization range → slots wasted on tails → normal values lose precision

02
🔄

The Rotation Fix

WHT rotation spreads outlier energy → range collapses → all coordinates follow N(0, 1/d)

03
🎯

Optimal Slots

Known distribution → Lloyd-Max finds optimal centroids → more slots near zero → PolarQuant

04
💡

The QJL Lesson

Residual correction reduces bias but increases variance → softmax amplifies variance → more centroids wins

05
⚖️

Asymmetric Insight

K errors → softmax → exponential damage. V errors → linear scaling. Keep K precise, compress V freely.

06
🚀

Stack Everything

Boundary V + Sparse V + Block 128 → orthogonal optimizations → 3.8–6.4× compression + 22.8% faster

Concept dependency map — how modules build on each other