Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

Vincent-Daniel Yun

Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

Vincent-Daniel Yun

TL;DR

This work analyzes why SGD slows in low-precision training by modeling quantized gradients as \\tilde{g} = q \\ g + \\varepsilon, which effectively scales the stepsize to \\mu_q = q_{\\min} \\mu. Under standard smoothness and (strong) convexity assumptions, the authors derive convergence bounds showing a slower \\mathcal{O}(1/k) rate and a higher asymptotic error due to gradient shrinkage and quantization noise, compared to full precision. The theory is corroborated by experiments on FP4/FP8/FP16 gradients, illustrating that increasing the nominal stepsize can recover performance close to higher-precision training. Practically, the results provide guidance for stepsize scheduling in mixed-precision optimization and quantify how precision choices impact convergence speed and accuracy.

Abstract

Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( μ_k \) with an effective stepsize \( μ_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.

Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

TL;DR

Abstract

Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (9)