Visual Essay Series · TurboQuant+

TurboQuant A Visual Journey Through KV-Cache Compression

From naive quantization to PolarQuant, Walsh-Hadamard rotations, and the surprising lesson that made turbo4 beat q4_0 — told through interactive visualizations.

explore the series

The Series

Six modules. One complete picture.

Why Naive Quantization Fails

KV-cache vectors have a hidden enemy — outliers. A handful of extreme values force your quantization scale to span the entire range, wasting every slot on the coordinates that matter most.

uniform quantization outliers slot width reconstruction error

Interactive Blog →

→

Random Rotations & Walsh-Hadamard Transform

Rotate the vector before quantizing — outlier energy spreads evenly across all dimensions. After rotation, every coordinate follows the same predictable bell curve N(0, 1/d). The WHT does this in O(d log d).

WHT butterfly N(0, 1/d) random signs norm preservation

Interactive Blog →

→

PolarQuant — Optimal Scalar Quantization

Now that we know the distribution is always N(0, 1/d), we can design optimal slot positions using Lloyd-Max. More slots near zero where data lives. One codebook works for every vector forever.

Lloyd-Max optimal centroids 7× compression norm correction

Step-through Visualizer → Blog

→

Critical Finding

The QJL Lesson — Why Error Correction Hurts

TurboQuant added a clever 1-bit residual correction called QJL. It reduced average error — yet made attention quality worse. The reason: softmax amplifies variance exponentially. More centroids beats error correction. Every time.

QJL residual bias vs variance softmax amplification turbo4 resurrection

Interactive Blog →

→

Advanced · Practical

Asymmetric K/V + Advanced Techniques

The most important practical finding: compressing V is essentially free, while K precision is everything. Softmax amplifies K errors exponentially; V errors scale linearly. Then — three orthogonal optimizations that stack for free: Boundary V protects critical layers, Sparse V skips negligible tokens (+22.8% speed), and Block size 128 eliminates redundant norm storage (+12% compression).

asymmetric K/V softmax sensitivity boundary V sparse V block size 128 +22.8% throughput

Interactive Blog →

→

📓 Python Notebooks

Source Code · GitHub

TurboQuant Tutorial — Python Notebooks

All six modules as runnable Jupyter notebooks. Includes the full TurboQuant library, codebook implementations, rotation utilities, and the experiment code behind every chart in this series.

The conceptual chain — how ideas connect

01

⚠️

The Problem

Outliers stretch the quantization range → slots wasted on tails → normal values lose precision

02

🔄

The Rotation Fix

WHT rotation spreads outlier energy → range collapses → all coordinates follow N(0, 1/d)

03

🎯

Optimal Slots

Known distribution → Lloyd-Max finds optimal centroids → more slots near zero → PolarQuant

04

💡

The QJL Lesson

Residual correction reduces bias but increases variance → softmax amplifies variance → more centroids wins

05

⚖️

Asymmetric Insight

K errors → softmax → exponential damage. V errors → linear scaling. Keep K precise, compress V freely.

06

🚀

Stack Everything

Boundary V + Sparse V + Block 128 → orthogonal optimizations → 3.8–6.4× compression + 22.8% faster

Concept dependency map — how modules build on each other