Table of Contents
Fetching ...

LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs

Ofir Gordon, Lior Dikstein, Arnon Netzer, Idan Achituve, Hai Victor Habi

TL;DR

The paper tackles the challenge of post-training quantization for large language models under the microscaling MX format, where per-block scales create a block-structured quantization that amplifies activation outliers. It introduces LATMiX, which learns invertible affine transformations (global T1 and per-block T2) parameterized via LU or QR decompositions and optimized with a distillation loss plus a volume-preserving regularizer, folded into weight matrices to avoid runtime overhead. Theoretical analysis derives an MX-specific error bound that balances transform conditioning and block-level activation magnitudes, guiding the design of full affine transformations over block-diagonal approaches. Empirically, LATMiX yields consistent improvements in MX low-bit quantization across seven zero-shot benchmarks and WikiText2 perplexity, demonstrating practical impact for deploying accurate, low-resource LLMs.

Abstract

Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.

LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs

TL;DR

The paper tackles the challenge of post-training quantization for large language models under the microscaling MX format, where per-block scales create a block-structured quantization that amplifies activation outliers. It introduces LATMiX, which learns invertible affine transformations (global T1 and per-block T2) parameterized via LU or QR decompositions and optimized with a distillation loss plus a volume-preserving regularizer, folded into weight matrices to avoid runtime overhead. Theoretical analysis derives an MX-specific error bound that balances transform conditioning and block-level activation magnitudes, guiding the design of full affine transformations over block-diagonal approaches. Empirically, LATMiX yields consistent improvements in MX low-bit quantization across seven zero-shot benchmarks and WikiText2 perplexity, demonstrating practical impact for deploying accurate, low-resource LLMs.

Abstract

Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.
Paper Structure (27 sections, 4 theorems, 35 equations, 4 figures, 11 tables)

This paper contains 27 sections, 4 theorems, 35 equations, 4 figures, 11 tables.

Key Result

Theorem 3.3

Assume that $\bm{x}$ is a continuous random vector, $\mathbf{T}$ is an affine transformation and $Q$ is the quantization of MX as defined in Eq. eq:q_mx. Then, under regularity assumptions on $\bm{x}$, Here, $f(x) \lesssim g(x)$ denotes that $f(x)$ is less than $g(x)$ up to a fixed multiplicative constant, and $\norm{\cdot}_{\sigma}$ denotes the spectral norm. Furthermore, if we assume that $\bm{

Figures (4)

  • Figure 1: LATMiX takes into account both the MX block structure and the distribution of features to diffuse outliers. In the figure, energy is distributed both within the block and among blocks to obtain lower quantization error.
  • Figure 2: Analysis of various transformation types: (1) Vanilla: no transformation applied; (2) Hadamard: Full Hadamard transform; (3) Block Hadamard: a block-diagonal matrix in which each block corresponds to an MX block with an Hadamard matrix; (4) a learned rotation matrix; and (5) a learned affine transformation that minimizes the objective in Eq. \ref{['eq:q_error']}. In Fig. \ref{['sfig:qe_vs_block_size']}, the Hadamard and learned rotation curves are superimposed.
  • Figure 3: LATMiX learns a transformation that spreads the energy across the tensor.
  • Figure 4: Location of all transformations on a regular LLM with marking of folding operations.

Theorems & Definitions (9)

  • Definition 3.1: General Affine Transformation
  • Definition 3.2: Transformation Mean Squared Error
  • Theorem 3.3: MX Quantization Error
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • proof
  • Proposition 5.1
  • proof