Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding
Taowen Liu, Marta Andronic, Deniz Gündüz, George A. Constantinides
TL;DR
The paper analyzes stochastic rounding (SR) within mixed-precision SGD for LLM training, distinguishing noise from weight quantization versus per-sample activation/gradient quantization. It develops a theoretical framework showing that gradient variance from per-sample SR decays as $1/b$, enabling larger batches to offset reduced precision, while weight-quantization bias remains an error floor. Empirical results on CIFAR-10 and LLM fine-tuning validate that SR outperforms deterministic rounding and that increasing batch size mitigates SR-induced degradation, providing actionable guidelines for trading precision for batch size on edge devices. The work also assesses hardware overhead, arguing that SR can be implemented with modest resource costs, thus offering practical pathways to edge-friendly training of large models.
Abstract
LLM training is resource-intensive. Quantized training improves computational and memory efficiency but introduces quantization noise, which can hinder convergence and degrade model accuracy. Stochastic Rounding (SR) has emerged as a theoretically attractive alternative to deterministic rounding, offering unbiased gradient estimates. However, its interaction with other training factors -- especially batch size -- remains under explored. In this paper, we present a theoretical and empirical study of mini-batch stochastic gradient descent (SGD) with SR, showing that increased batch sizes can compensate for reduced precision during back-propagation. Furthermore, we show that quantizing weights and activations impacts gradient variance in distinct ways. Our experiments validate these theoretical insights.
