Table of Contents
Fetching ...

Beyond Outliers: A Study of Optimizers Under Quantization

Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh

TL;DR

This work investigates how optimizer choice influences model robustness under quantization, addressing both PTQ and QAT in large language models. By training full-precision models with six optimizers across scales from 50M to 1.5B and applying PTQ and QAT, the authors show that conventional outlier metrics like $MMR$ and $Kurtosis$ fail to predict PTQ outcomes and that optimizer rankings under FP can shift under quantization. They introduce an ABC decomposition framework to analyze error propagation and derive scaling laws that reveal Shampoo provides the best parameter efficiency under QAT. The findings inform optimizer selection for efficient deployment and offer practical guidance for quantization strategies across model scales. The results carry implications for designing quantization-aware training pipelines and for understanding how optimization dynamics translate into quantization robustness at scale.

Abstract

As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.

Beyond Outliers: A Study of Optimizers Under Quantization

TL;DR

This work investigates how optimizer choice influences model robustness under quantization, addressing both PTQ and QAT in large language models. By training full-precision models with six optimizers across scales from 50M to 1.5B and applying PTQ and QAT, the authors show that conventional outlier metrics like and fail to predict PTQ outcomes and that optimizer rankings under FP can shift under quantization. They introduce an ABC decomposition framework to analyze error propagation and derive scaling laws that reveal Shampoo provides the best parameter efficiency under QAT. The findings inform optimizer selection for efficient deployment and offer practical guidance for quantization strategies across model scales. The results carry implications for designing quantization-aware training pipelines and for understanding how optimization dynamics translate into quantization robustness at scale.

Abstract

As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.

Paper Structure

This paper contains 23 sections, 24 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Accuracy correlation with different metrics for the 760M model. Traditional outlier-sensitive metrics like MMR (Left) and kurtosis (Center) show little to no correlation (measured by $\rho$) with model accuracy, whereas our proposed metric (Right) correlates strongly with the model’s zero-shot performance. MMR and kurtosis are computed row-wise, on the output of the last transformer block.
  • Figure 2: The effect of changing learning-rate $\eta$ on the validation loss and MMR in different optimizers for 760M model. We average the MMR over the rows in the input tensor of the last linear layer (before head).
  • Figure 3: ABC decomposition for the $760$M models. The x-axis shows the module index $\ell$. Here is the only exception where we use the truncated average (average after we ignore the top 1% of the values) as our summary statistic. $\mathop{\mathrm{Trunc}}\nolimits(R_\ell)$ and $\mathop{\mathrm{Avg}}\nolimits(R_\ell)$ only differ non-negligibly for a single intermediate layer for Shampoo, where a spike in $\mathop{\mathrm{Avg}}\nolimits(R_\ell)$ would force us to use log-scale.
  • Figure 4: Gain decomposition for the $760$M models with different optimizers. The x-axis shows the module index $\ell$.
  • Figure 5: Scaling Laws for each optimizer, for full precision (BF16) and QAT (W4A4). For each optimizer, we report the parameter efficiency $\rho$ of 4-bit QAT in the subplot title. Shampoo has the highest parameter efficiency, followed by AdamW.