Table of Contents
Fetching ...

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

Taowen Liu, Marta Andronic, Deniz Gündüz, George A. Constantinides

TL;DR

The paper analyzes stochastic rounding (SR) within mixed-precision SGD for LLM training, distinguishing noise from weight quantization versus per-sample activation/gradient quantization. It develops a theoretical framework showing that gradient variance from per-sample SR decays as $1/b$, enabling larger batches to offset reduced precision, while weight-quantization bias remains an error floor. Empirical results on CIFAR-10 and LLM fine-tuning validate that SR outperforms deterministic rounding and that increasing batch size mitigates SR-induced degradation, providing actionable guidelines for trading precision for batch size on edge devices. The work also assesses hardware overhead, arguing that SR can be implemented with modest resource costs, thus offering practical pathways to edge-friendly training of large models.

Abstract

LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors -- especially batch size -- remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

TL;DR

The paper analyzes stochastic rounding (SR) within mixed-precision SGD for LLM training, distinguishing noise from weight quantization versus per-sample activation/gradient quantization. It develops a theoretical framework showing that gradient variance from per-sample SR decays as , enabling larger batches to offset reduced precision, while weight-quantization bias remains an error floor. Empirical results on CIFAR-10 and LLM fine-tuning validate that SR outperforms deterministic rounding and that increasing batch size mitigates SR-induced degradation, providing actionable guidelines for trading precision for batch size on edge devices. The work also assesses hardware overhead, arguing that SR can be implemented with modest resource costs, thus offering practical pathways to edge-friendly training of large models.

Abstract

LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors -- especially batch size -- remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.

Paper Structure

This paper contains 32 sections, 4 theorems, 46 equations, 7 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Let $\widehat{\boldsymbol{w}}=\mathbb{Q}_{\Delta W}(\boldsymbol{w},\epsilon_W)$ with quantization step $\Delta W$. The difference between the QAT gradient $\nabla_{\widehat{\boldsymbol{w}}} L(\widehat{\boldsymbol{w}}, x, y)$ and the true gradient $\nabla L(\boldsymbol{w}, x, y)$ is uniformly bounded where $B_W =\frac{1}{2} \mathcal{L} \sqrt{d} \Delta_W$ with RTN and $B_W = \mathcal{L} \sqrt{d} \De

Figures (7)

  • Figure 1: Stochastic rounding (SR) achieves higher accuracy with larger batch sizes, while round-to-nearest (RTN) fails to converge at the same precision.
  • Figure 2: The forward pass is in blue, and the backward pass in red. Solid arrows represent data flow, while dashed arrows indicate the flow of quantized values.
  • Figure 3: Stochastic Rounding Mixed-Precision SGD for QAT Objectives: Weight quantization is shared across the forward and backward passes. Activation is kept with high precision in the forward pass.
  • Figure 4: To guarantee the same variance, increase batch size by at most $4\times$ when reducing $1$ bit of precision.
  • Figure 5: Practical experiments with image models. The error bars in (a) and (b) represent the 25th-75th percentiles across independent runs.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Definition 1: Threshold Quantization Function
  • Definition 2: Round-to-nearest (RTN)
  • Definition 3: Stochastic Rounding (SR)
  • Definition 4: Gradient Approximation
  • Definition 5: Stochastic Rounding Mini-batch Mixed-precision SGD
  • Lemma 1: Bounded Gradient Bias from Weight Quantization
  • proof
  • Lemma 2: Error Decomposition for Fully Quantized Gradient Component
  • Lemma 3: Quantization Error under Per-Sample SR Scaling and Batch Size
  • Theorem 1: Convergence of SGD with Low-Precision Gradients
  • ...and 4 more