Defeating the Training-Inference Mismatch via FP16
Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
TL;DR
This paper identifies the training–inference mismatch in RL fine-tuning of large language models as a fundamental numerical-precision problem, driven by BF16's reduced mantissa and associated rounding errors. By switching to FP16 for both training and inference, the authors demonstrate dramatic improvements in stability, convergence speed, and final performance across multiple models, datasets, and RL algorithms, effectively eliminating the need for complex correction schemes. The evidence spans offline analyses, sanity tests, and extensive experiments with MoE, LoRA, and large dense models, showing that a simple precision change can improve the effectiveness of policy-gradient methods and reduce deployment gaps. The findings advocate rethinking precision trade-offs in RL fine-tuning and suggest FP16 as a robust, scalable option for reliable RL-based LLM alignment. These results have practical impact by simplifying RL pipelines while delivering stronger performance, and they motivate further exploration of precision strategies beyond BF16 in large-scale RL settings.
Abstract
Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
