Table of Contents
Fetching ...

AXELRAM: Quantize Once, Never Dequantize

Yasushi Nishida

Abstract

We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.

AXELRAM: Quantize Once, Never Dequantize

Abstract

We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.

Paper Structure

This paper contains 20 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Quantize Once, Never Dequantize. Conventional rotation-based quantization (top) must dequantize and inverse-rotate each stored key before computing attention---$T \times O(d\log d)$ per query. AXELRAM (bottom) computes attention entirely within the rotated transform domain: the query is rotated once, and scores are computed via pre-computed codebook products with no dequantization. The orthogonal invariance $\langle \mathbf{q}, \mathbf{k} \rangle = \langle R\mathbf{q}, R\mathbf{k} \rangle$ makes this possible, yielding $102.4\times$ fewer multiplications.
  • Figure 2: AXELRAM smart SRAM macro. Write path (left): norm extraction, FWHT butterfly network (448 add/sub, zero multipliers), Lloyd-Max comparator quantization (896 comparators), writing 3-bit indices + FP16 norm to SRAM. Read path (right): pre-computed table lookup (128 parallel reads), adder tree (127 adders), norm scaling (1 multiplication). Pre-computation (top center, once per query): table generation with 1024 multiplications. The fixed codebook ROM (30 bytes) is shared by both paths across all $d{=}128$ dimensions.
  • Figure 3: Read path detail. Phase 1 (once per query): FWHT rotation of query, table generation ($d \times 2^b$ multiplications). Phase 2 (repeated $T$ times): table lookup, adder tree, norm scaling (1 multiplication per key). Total: 5,120 multiplications for $T{=}4096$ versus 524,288 conventional ($102.4\times$ reduction).