Table of Contents
Fetching ...

LOTION: Smoothing the Optimization Landscape for Quantized Training

Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade

TL;DR

Quantized training yields a non-differentiable, piecewise-constant loss $\mathcal{L}(\mathrm{cast}(w))$, complicating gradient-based optimization. LOTION replaces gradient-based manipulation with loss-level smoothing by optimizing the expectation of the quantized loss under unbiased randomized rounding, yielding a differentiable surrogate and convergence guarantees to a local minimum; with stochastic rounding it preserves all global minima of the original quantized problem. The authors derive a Gauss--Newton–style second-order regularizer $\tfrac{1}{2}\operatorname{tr}(G(w)\Sigma_{\varepsilon}(w))$ and show that randomized rounding induces a curvature-aware, diagonal regularizer that stabilizes training, including a diagonal form $\mathcal{L}_{\mathrm{GN}}(w) = \mathcal{L}(w) + \tfrac{1}{2}\sum_i g_{ii}(w)\sigma_i^{2}$. Empirically, LOTION improves quantized performance over QAT and PTQ on synthetic tasks and scales to 150M–300M parameter language models, delivering lower quantized validation loss at INT4, INT8, and FP4 with consistent gains and no extra tuning, highlighting practical benefits for low-precision deployment.

Abstract

Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

LOTION: Smoothing the Optimization Landscape for Quantized Training

TL;DR

Quantized training yields a non-differentiable, piecewise-constant loss , complicating gradient-based optimization. LOTION replaces gradient-based manipulation with loss-level smoothing by optimizing the expectation of the quantized loss under unbiased randomized rounding, yielding a differentiable surrogate and convergence guarantees to a local minimum; with stochastic rounding it preserves all global minima of the original quantized problem. The authors derive a Gauss--Newton–style second-order regularizer and show that randomized rounding induces a curvature-aware, diagonal regularizer that stabilizes training, including a diagonal form . Empirically, LOTION improves quantized performance over QAT and PTQ on synthetic tasks and scales to 150M–300M parameter language models, delivering lower quantized validation loss at INT4, INT8, and FP4 with consistent gains and no extra tuning, highlighting practical benefits for low-precision deployment.

Abstract

Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

Paper Structure

This paper contains 42 sections, 4 theorems, 25 equations, 12 figures, 2 tables.

Key Result

Lemma 1

For any loss function $L(w)$ which is continuous w.r.t $L_2$ norm and any $f: \mathbb{R}^d \to \mathbb{P}[Q]$ satisfying the 2nd axiom above, $\mathbb{E}_{q \sim f(w)}[L(q)]$ is also continuous w.r.t the $L_2$ norm.

Figures (12)

  • Figure 1: Quantized validation loss at INT4 precision for LOTION and QAT on a 150M-parameter model. We quantize checkpoints with round-to-nearest (RTN) and randomized rounding (RR) and, for each method, plot the variant that yields the lowest validation loss. The full plot can be found in Figure \ref{['fig:extra150m5x']}.
  • Figure 2: A comparison of INT4 quantized/rounded validation loss between LOTION, QAT, and PTQ, with summary table. We quantize using round-to-nearest (RTN) and randomized rounding (RR) and report the variant that yields the best performance. Full results are shown in Figure \ref{['fig:extrasynthetic_LR_sweep_lin_reg']}.
  • Figure 3: Final quantized training loss as a function of the hidden dimension, $k$, of a two layer linear network for LOTION, QAT, GT, and PTQ. Full results are shown in Figure \ref{['fig:extrasynthetic_LR_sweep_Kval']}.
  • Figure 4: Quantized validation loss at INT4 (Left) and INT8 (Right) precision for LOTION, QAT, and PTQ on a 300M-parameter model. Full results are shown in Figure \ref{['fig:extra300m']}.
  • Figure 5: Quantized validation loss at FP4 precision for LOTION, QAT, and PTQ. Full results are shown in Figure \ref{['fig:extrafp4']}.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Remark 1
  • Definition 1: Randomized Rounding
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • proof
  • proof
  • proof