Table of Contents
Fetching ...

Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization

Xi Chen, Ming Li, Junxi Li, Changsheng Li, Peisong Wang, Lizhong Ding, Ye Yuan, Guoren Wang

TL;DR

Astro tackles the accuracy drop in weight-only post-training quantization for large language models caused by weight and activation outliers. It introduces Activation-guided Structured Regularization, leveraging the flat minima of over-parameterized LLMs to reconstruct robust weights without increasing inference latency, and it can complement standard PTQ methods such as GPTQ. The approach is underpinned by a Flat Minima theorem and an activation-weight coupled error bound that guide targeted regularization to suppress high-magnitude outliers in high-activation groups. Empirically, Astro achieves competitive perplexity and downstream task performance on LLaMA-2 scales while substantially reducing quantization time and preserving zero-latency inference, illustrating strong practical impact for hardware-aware model compression.

Abstract

Weight-only post-training quantization (PTQ) is crucial for efficient Large Language Model (LLM) deployment but suffers from accuracy degradation caused by weight and activation outliers. Existing mitigation strategies often face critical limitations: they either yield insufficient outlier suppression or incur significant deployment inefficiencies, such as inference latency, heavy preprocessing, or reliance on complex operator fusion. To resolve these limitations, we leverage a key insight: over-parameterized LLMs often converge to Flat Minima, implying a vast equivalent solution space where weights can be adjusted without compromising accuracy. Building on this, we propose Astro, an Activation-guided Structured Regularization framework designed to suppress the negative effects of outliers in a hardware-friendly and efficient manner. Leveraging the activation-guided regularization objective, Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations without sacrificing model accuracy. Crucially, Astro introduces zero inference latency and is orthogonal to mainstream quantization methods like GPTQ. Extensive experiments show that Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.

Astro: Activation-guided Structured Regularization for Outlier-Robust LLM Post-Training Quantization

TL;DR

Astro tackles the accuracy drop in weight-only post-training quantization for large language models caused by weight and activation outliers. It introduces Activation-guided Structured Regularization, leveraging the flat minima of over-parameterized LLMs to reconstruct robust weights without increasing inference latency, and it can complement standard PTQ methods such as GPTQ. The approach is underpinned by a Flat Minima theorem and an activation-weight coupled error bound that guide targeted regularization to suppress high-magnitude outliers in high-activation groups. Empirically, Astro achieves competitive perplexity and downstream task performance on LLaMA-2 scales while substantially reducing quantization time and preserving zero-latency inference, illustrating strong practical impact for hardware-aware model compression.

Abstract

Weight-only post-training quantization (PTQ) is crucial for efficient Large Language Model (LLM) deployment but suffers from accuracy degradation caused by weight and activation outliers. Existing mitigation strategies often face critical limitations: they either yield insufficient outlier suppression or incur significant deployment inefficiencies, such as inference latency, heavy preprocessing, or reliance on complex operator fusion. To resolve these limitations, we leverage a key insight: over-parameterized LLMs often converge to Flat Minima, implying a vast equivalent solution space where weights can be adjusted without compromising accuracy. Building on this, we propose Astro, an Activation-guided Structured Regularization framework designed to suppress the negative effects of outliers in a hardware-friendly and efficient manner. Leveraging the activation-guided regularization objective, Astro actively reconstructs intrinsically robust weights, aggressively suppressing weight outliers corresponding to high-magnitude activations without sacrificing model accuracy. Crucially, Astro introduces zero inference latency and is orthogonal to mainstream quantization methods like GPTQ. Extensive experiments show that Astro achieves highly competitive performance; notably, on LLaMA-2-7B, it achieves better performance than complex learning-based rotation methods with almost 1/3 of the quantization time.
Paper Structure (31 sections, 2 theorems, 37 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 2 theorems, 37 equations, 3 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.3

Consider a pre-trained LLM parameterized by $\bm{\Theta}_{\text{orig}}$. Under Assumption ass:hessian, for any loss tolerance $\epsilon > 0$, there exists a convex region $\mathcal{M}_{\epsilon} \subset \mathcal{V}_{\text{flat}}$ centered at $\bm{\Theta}_{\text{orig}}$, defined as: where the radius is $\delta = \sqrt{2\epsilon/\gamma}$. Within this region, for all $\bm{\Theta} \in \mathcal{M}_{\e

Figures (3)

  • Figure 1: Magnitude distributions of weight and activations at one layer in LLaMA-2-7B. Both LLM weights and activations exhibit significant outliers (highlighted in red), which cause severe performance degradation in low-bit quantization settings.
  • Figure 2: Overview of the Astro Framework.Left: Visualization of the loss landscape for LLaMA-2-7B on WikiText-2. Astro exploits the Flat Minima landscape of pre-trained LLMs, searching for quantization-robust weights (Green) that are functionally equivalent to the original (Red). Right: The search is driven by Activation-guided Structured Regularization. By coupling regularization strength $\alpha_k$ with activation magnitude $\|\mathbf{X}^{(k)}\|_F$, Astro adaptively suppresses critical weight outliers in high-activation groups while preserving reconstruction fidelity elsewhere. The reconstructed weights directly replace the original, introducing zero inference latency.
  • Figure 3: Efficiency vs. Performance Trade-off (LLaMA-2-7B, W3A16g128). Comparison of perplexity (PPL $\downarrow$) and total quantization time. "W3A16g128" denotes 3-bit weights, 16-bit activations, and a group size of 128.

Theorems & Definitions (7)

  • Definition 4.2: Flat Subspace
  • Theorem 4.3: Existence of Robust Solution Region
  • Remark 4.4
  • Theorem 4.5: Activation-Weight Coupled Error Bound
  • Remark 4.6
  • proof
  • proof