Table of Contents
Fetching ...

AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation

Seonggon Kim, Alireza Khodamoradi, Kristof Denolf, Eunhyeok Park

Abstract

Low-precision training (LPT) commonly employs Hadamard transforms to suppress outliers and mitigate quantization error in large language models (LLMs). However, prior methods apply a fixed transform uniformly, despite substantial variation in outlier structures across tensors. Through the first systematic study of outlier patterns across weights, activations, and gradients of LLMs, we show that this strategy is fundamentally flawed: the effectiveness of Hadamard-based suppression depends on how the transform's smoothing direction aligns with the outlier structure of each operand -- a property that varies substantially across layers and computation paths. We characterize these patterns into three types: Row-wise, Column-wise, and None. Each pair requires a tailored transform direction or outlier handling strategy to minimize quantization error. Based on this insight, we propose AdaHOP (Adaptive Hadamard transform with Outlier-Pattern-aware strategy), which assigns each matrix multiplication its optimal strategy: Inner Hadamard Transform (IHT) where inner-dimension smoothing is effective, or IHT combined with selective Outlier Extraction (OE) -- routing dominant outliers to a high-precision path -- where it is not. Combined with hardware-aware Triton kernels, AdaHOP achieves BF16 training quality at MXFP4 precision while delivering up to 3.6X memory compression and 1.8X kernel acceleration} over BF16 full-precision training.

AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation

Abstract

Low-precision training (LPT) commonly employs Hadamard transforms to suppress outliers and mitigate quantization error in large language models (LLMs). However, prior methods apply a fixed transform uniformly, despite substantial variation in outlier structures across tensors. Through the first systematic study of outlier patterns across weights, activations, and gradients of LLMs, we show that this strategy is fundamentally flawed: the effectiveness of Hadamard-based suppression depends on how the transform's smoothing direction aligns with the outlier structure of each operand -- a property that varies substantially across layers and computation paths. We characterize these patterns into three types: Row-wise, Column-wise, and None. Each pair requires a tailored transform direction or outlier handling strategy to minimize quantization error. Based on this insight, we propose AdaHOP (Adaptive Hadamard transform with Outlier-Pattern-aware strategy), which assigns each matrix multiplication its optimal strategy: Inner Hadamard Transform (IHT) where inner-dimension smoothing is effective, or IHT combined with selective Outlier Extraction (OE) -- routing dominant outliers to a high-precision path -- where it is not. Combined with hardware-aware Triton kernels, AdaHOP achieves BF16 training quality at MXFP4 precision while delivering up to 3.6X memory compression and 1.8X kernel acceleration} over BF16 full-precision training.

Paper Structure

This paper contains 30 sections, 5 theorems, 18 equations, 7 figures, 5 tables.

Key Result

Proposition 1

For matrix $A \in \mathbb{R}^{m \times k}$ with row outliers, left multiplication reduces the outlier factor by $\gamma(H_m A) \approx \frac{1}{m}\gamma(A)$, while right multiplication leaves it unchanged: $\gamma(A H_k) \approx \gamma(A)$. Symmetrically, for matrix $B \in \mathbb{R}^{k \times n}$ w

Figures (7)

  • Figure 1: Training loss curves and loss difference relative to BF16 for (Left) Llama3.2-1B and (Right) Instella-3B. AdaHOP consistently achieves the lowest loss gap relative to BF16 among all MXFP4-based methods.
  • Figure 2: (Left) 3D visualization of Weight, Activation, and Gradient tensors from Llama3.2-1B's block.2.self_attn_o after 100 training steps on C4. The Weight tensor shows no pronounced outlier structure, while the Activation exhibits column-wise outliers and the Gradient exhibits row-wise outliers. (Right) 3D visualization of a column-wise outlier tensor after applying different Hadamard transform directions. Right Hadamard (which mixes values within each row, across columns) effectively suppresses column-wise outliers, while Left Hadamard leaves the outlier structure intact.
  • Figure 3: Improvement in quantization error when applying IHT for each outlier pattern pair. (Left) Synthetic outlier tensors (Row/Col kurtosis = 225.95, None kurtosis = 0). (Right) Real tensors from Llama3.2-3B's layers.23.feed_forward.out_proj (Row/Col kurtosis = 197.81, None kurtosis = 6.29). IHT effectively reduces error for CR pairs but is ineffective or harmful for RC, RN, RR, NC, and CC pairs.
  • Figure 4: Outlier patterns of Weight ($W$), Activation ($X$), and Gradient ($G_Y$) tensors across 300 training steps for Llama3.2-3B. Each row represents a tensor from a specific layer, and the color indicates the detected outlier pattern at each step. The patterns remain stable throughout training, enabling one-time calibration. Outlier patterns of the other layers are provided in \ref{['sec:OP_figure']}.
  • Figure 5: The pipeline of AdaHOP. For each linear layer's three matrix multiplications, AdaHOP selectively applies IHT, OE (Left or Right), or high-precision computation based on the detected outlier pattern pair ($P_n$). Here $n \in \{1, 3, 5\}$ indexes the input tensors.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Proposition 1: Transform Effectiveness for Outlier Patterns
  • Lemma 1: Hadamard Mixing Direction
  • proof : Proof of \ref{['prop:transform_effectiveness']}
  • Theorem 1: Error Bound for Row-Column Pattern
  • proof
  • Corollary 1: OHT Improvement Factor
  • Theorem 2: OE Error Bound
  • proof