Table of Contents
Fetching ...

Training Dynamics Impact Post-Training Quantization Robustness

Albert Catalan-Tatjer, Niccolò Ajroldi, Jonas Geiping

TL;DR

A comprehensive analysis of quantization degradation across open-source language model training trajectories finds that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters.

Abstract

While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

Training Dynamics Impact Post-Training Quantization Robustness

TL;DR

A comprehensive analysis of quantization degradation across open-source language model training trajectories finds that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters.

Abstract

While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

Paper Structure

This paper contains 30 sections, 29 figures.

Figures (29)

  • Figure 1: Evolution of quantization error and validation loss during training of SmolLM3bakouch2025smollm3. We report validation loss for the full precision weights (Figure \ref{['fig:smol_a']}) and 3- and 4-bit quantization error (Figures \ref{['fig:smol_b']} and \ref{['fig:smol_c']}) throughout training under both the constant ($\eta=2e^{-4}$, up to 10T tokens) and annealing phases of the learning rate schedule (whose evolution is shown as dotted lines). As the learning rate decays, validation loss consistently decreases, whereas quantization error rises sharply and to a much greater extent than at any earlier point in training.
  • Figure 2: Evolution of quantization error and validation loss on OpenSci-1.3B model nezhurina_open-sci-ref-001_2025 trained on 1T tokens from Nemotron-cc su2025nemotroncc. Quantization degradation increases drastically as the learning rate decays and the model improves, consistent with previously observed patterns.
  • Figure 3: 3-bit quantization error along the training trajectories of OLMo2 models. Error grows gradually during cosine decay but spikes under the steep linear decay phase. Model souping ($\star$) reduces degradation, with the soups achieving lower PTQ error than the individual runs.
  • Figure 4: Validation loss and accuracy degradation follows a similar trend in SmolLM3. Degradation in validation loss (left) and downstream accuracy (right) show that PTQ effects differ across stages and appear sensitive to post-training interventions. The final model, a weighted average of mid-training and APO, shows better robustness than both individual components.
  • Figure 5: Learning rate decay triggers quantization degradation at different training durations. We use WSD, training a 160M-parameter transformer up to 100B tokens and performing additional cooldowns at 12B, 28B, 46B, 64B, 82B tokens. Figure \ref{['fig:rainbow_a']} shows quantization error during training with different token budgets, and Figure \ref{['fig:rainbow_b']} the corresponding validation loss. Despite varying the amount of training data, all runs show comparable quantization error after cooldown, highlighting that error spikes are associated with training dynamics rather than token budget.
  • ...and 24 more figures