Table of Contents
Fetching ...

Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

Vincent-Daniel Yun

TL;DR

This work analyzes why SGD slows in low-precision training by modeling quantized gradients as \\tilde{g} = q \\ g + \\varepsilon, which effectively scales the stepsize to \\mu_q = q_{\\min} \\mu. Under standard smoothness and (strong) convexity assumptions, the authors derive convergence bounds showing a slower \\mathcal{O}(1/k) rate and a higher asymptotic error due to gradient shrinkage and quantization noise, compared to full precision. The theory is corroborated by experiments on FP4/FP8/FP16 gradients, illustrating that increasing the nominal stepsize can recover performance close to higher-precision training. Practically, the results provide guidance for stepsize scheduling in mixed-precision optimization and quantify how precision choices impact convergence speed and accuracy.

Abstract

Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor \( q_k \in (0,1] \). We show that this shrinkage affect the usual stepsize \( μ_k \) with an effective stepsize \( μ_k q_k \), slowing convergence when \( q_{\min} < 1 \). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by \( q_{\min} \), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.

Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

TL;DR

This work analyzes why SGD slows in low-precision training by modeling quantized gradients as \\tilde{g} = q \\ g + \\varepsilon, which effectively scales the stepsize to \\mu_q = q_{\\min} \\mu. Under standard smoothness and (strong) convexity assumptions, the authors derive convergence bounds showing a slower \\mathcal{O}(1/k) rate and a higher asymptotic error due to gradient shrinkage and quantization noise, compared to full precision. The theory is corroborated by experiments on FP4/FP8/FP16 gradients, illustrating that increasing the nominal stepsize can recover performance close to higher-precision training. Practically, the results provide guidance for stepsize scheduling in mixed-precision optimization and quantify how precision choices impact convergence speed and accuracy.

Abstract

Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor . We show that this shrinkage affect the usual stepsize with an effective stepsize , slowing convergence when . With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by , and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.

Paper Structure

This paper contains 12 sections, 8 theorems, 28 equations, 2 figures.

Key Result

Lemma 1

Under Assumption assum1, the iterates of SGD with low-precision gradient $\tilde{g}(w_k,\xi_k)$ satisfy, for all $k \in \mathbb{N}$,

Figures (2)

  • Figure 1: Quantization effect on a slowly decaying gradient-like function $g = e^{-0.2x}$ without AMP or loss scaling.
  • Figure 2: Training curves for ResNet-50 on CIFAR-10 over 100 epochs. (A) Train loss using FP4, FP8, FP16, and FP32 quantized gradients with a stepsize of $1\times10^{-4}$. (B) Test accuracy under the same setting. (C) Train loss using FP4 gradients with an increased stepsize of $5\times10^{-4}$. (D) Test accuracy for FP4 with the larger stepsize.

Theorems & Definitions (9)

  • Lemma 1
  • Lemma 2
  • Theorem 3: Strongly Convex Objective, Fixed Stepsize with Quantization
  • Theorem 4: Strongly Convex Objective, Diminishing Stepsizes with Quantization
  • Remark 5: Convergence Under Quantization
  • Lemma 1
  • Lemma 2
  • Theorem 3: Strongly Convex Objective, Fixed Stepsize with Quantization
  • Theorem 4: Strongly Convex Objective, Diminishing Stepsizes with Quantization