Table of Contents
Fetching ...

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Zhongping Ji

Abstract

Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive $O(d^2)$ storage and compute. RotorQuant reduces this cost with blockwise $3$D Clifford rotors, yet the resulting $3$D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of $SO(4)$. It represents each $4$D block as a quaternion and applies a closed-form transform $T(v)=q_L v \overline{q_R}$. This yields two main variants: \emph{IsoQuant-Full}, which realizes the full $SO(4)$ rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight $2$D special case. At $d=128$, IsoQuant-Full reduces forward rotation cost from about $2{,}408$ FMAs in RotorQuant to $1{,}024$, while IsoQuant-Fast further reduces it to $512$. Across $18$ fused CUDA settings with $d \in {128,256,512}$, bit widths ${2,3,4}$, and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about $4.5\times$--$4.7\times$ over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above $6\times$. Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

IsoQuant: Hardware-Aligned SO(4) Isoclinic Rotations for LLM KV Cache Compression

Abstract

Orthogonal feature decorrelation is effective for low-bit online vector quantization, but dense random orthogonal transforms incur prohibitive storage and compute. RotorQuant reduces this cost with blockwise D Clifford rotors, yet the resulting D partition is poorly aligned with modern hardware and offers limited local mixing. We propose \textbf{IsoQuant}, a blockwise rotation framework based on quaternion algebra and the isoclinic decomposition of . It represents each D block as a quaternion and applies a closed-form transform . This yields two main variants: \emph{IsoQuant-Full}, which realizes the full rotation, and \emph{IsoQuant-Fast}, which keeps only one isoclinic factor for lower cost; the framework also admits a lightweight D special case. At , IsoQuant-Full reduces forward rotation cost from about FMAs in RotorQuant to , while IsoQuant-Fast further reduces it to . Across fused CUDA settings with , bit widths , and FP16/FP32 execution, IsoQuant achieves mean kernel-level speedups of about -- over RotorQuant while maintaining comparable reconstruction MSE, with peak speedups above . Current validation is limited to the stage-1 quantize--dequantize path on synthetic normalized vectors; end-to-end KV-cache evaluation remains future work.

Paper Structure

This paper contains 33 sections, 1 theorem, 36 equations, 3 tables, 1 algorithm.

Key Result

Proposition 1

Let $q_L, q_R \in S^3$ be unit quaternions. Then the map defines an orthogonal transformation of $\mathbb{R}^4$. Its inverse is and the pairs $(q_L,q_R)$ and $(-q_L,-q_R)$ induce the same element of $SO(4)$.

Theorems & Definitions (2)

  • Proposition 1
  • proof : Proof sketch