TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
Yuxiang Chen, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, Jianfei Chen
TL;DR
TetraJet-v2 tackles the high cost of pre-training large language models by delivering end-to-end FP4 training with NVFP4 for activations, weights, and gradients. The key innovations are unbiased double-block quantization for NVFP4 linear layers, OsciReset to suppress weight oscillation, and OutControl to manage activation/outlier effects via Random Hadamard Transform and selective precision retention. Empirical results on LLMs up to 370M parameters and 200B tokens show consistent improvements over prior FP4 methods and a substantial reduction (about 51%) in the performance gap to full-precision training. This work advances practical FP4 pre-training by addressing both weight dynamics and outlier handling, with potential impact for hardware-aware, low-cost training of larger models.
Abstract
Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.
