Optimizing Large Language Model Training Using FP4 Quantization
Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng
TL;DR
The paper tackles the high computational cost of pretraining large language models by proposing FP4 training, an ultra-low precision framework designed to maintain accuracy. It introduces a differentiable gradient estimator to enable meaningful weight updates and an outlier clamping/compensation scheme to stabilize activations, achieving performance close to BF16 and FP8 baselines on models up to 13B parameters trained for up to 100B tokens. Through experiments, it demonstrates near-parity in training loss and downstream task performance, while detailing a theoretical speedup analysis and practical overheads. The work further argues for future hardware support for FP4 to unlock substantial energy savings and throughput in large-scale AI training.
Abstract
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
