Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?
Vincent-Daniel Yun
TL;DR
This work analyzes why SGD slows in low-precision training by modeling quantized gradients as \\tilde{g} = q \\ g + \\varepsilon, which effectively scales the stepsize to \\mu_q = q_{\\min} \\mu. Under standard smoothness and (strong) convexity assumptions, the authors derive convergence bounds showing a slower \\mathcal{O}(1/k) rate and a higher asymptotic error due to gradient shrinkage and quantization noise, compared to full precision. The theory is corroborated by experiments on FP4/FP8/FP16 gradients, illustrating that increasing the nominal stepsize can recover performance close to higher-precision training. Practically, the results provide guidance for stepsize scheduling in mixed-precision optimization and quantify how precision choices impact convergence speed and accuracy.
Abstract
Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( μ_k \) with an effective stepsize \( μ_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.
