Table of Contents
Fetching ...

Defeating the Training-Inference Mismatch via FP16

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

TL;DR

This paper identifies the training–inference mismatch in RL fine-tuning of large language models as a fundamental numerical-precision problem, driven by BF16's reduced mantissa and associated rounding errors. By switching to FP16 for both training and inference, the authors demonstrate dramatic improvements in stability, convergence speed, and final performance across multiple models, datasets, and RL algorithms, effectively eliminating the need for complex correction schemes. The evidence spans offline analyses, sanity tests, and extensive experiments with MoE, LoRA, and large dense models, showing that a simple precision change can improve the effectiveness of policy-gradient methods and reduce deployment gaps. The findings advocate rethinking precision trade-offs in RL fine-tuning and suggest FP16 as a robust, scalable option for reliable RL-based LLM alignment. These results have practical impact by simplifying RL pipelines while delivering stronger performance, and they motivate further exploration of precision strategies beyond BF16 in large-scale RL settings.

Abstract

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

Defeating the Training-Inference Mismatch via FP16

TL;DR

This paper identifies the training–inference mismatch in RL fine-tuning of large language models as a fundamental numerical-precision problem, driven by BF16's reduced mantissa and associated rounding errors. By switching to FP16 for both training and inference, the authors demonstrate dramatic improvements in stability, convergence speed, and final performance across multiple models, datasets, and RL algorithms, effectively eliminating the need for complex correction schemes. The evidence spans offline analyses, sanity tests, and extensive experiments with MoE, LoRA, and large dense models, showing that a simple precision change can improve the effectiveness of policy-gradient methods and reduce deployment gaps. The findings advocate rethinking precision trade-offs in RL fine-tuning and suggest FP16 as a robust, scalable option for reliable RL-based LLM alignment. These results have practical impact by simplifying RL pipelines while delivering stronger performance, and they motivate further exploration of precision strategies beyond BF16 in large-scale RL settings.

Abstract

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

Paper Structure

This paper contains 34 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Training reward comparison between BF16 and FP16. We evaluate across diverse settings: our Sanity test (\ref{['sec:perfectible_rl']}) with various algorithms (GRPO, GSPO, TIS, MIS, PG); different model families (R1D, Qwen and OctoThinker); alternative fine-tuning methods (Lora); and larger scale models (Dense-14B, MoE). Results are validated on two independent frameworks (VeRL and Oat).
  • Figure 2: FP16 significantly reduces the training-inference mismatch. The left two plots show the token-level probability distribution, and the right two plots present the distribution of sequence-level log probability ratio between the inference policy ($\mathop{\mathrm{\textcolor{red}{\mu}}}\limits$) and the training policy ($\mathop{\mathrm{\textcolor{blue}{\pi}}}\limits$). Dashed lines in black denote perfect precision without mismatch.
  • Figure 3: Simply switching from BF16 to FP16 stabilizes and prolongs RL training. The basic importance-weighted policy gradient algorithm in FP16 outperforms all baselines in BF16. Note that the third metric reported in each row slightly differs in implementation due to the use of separate codebases (VeRL and Oat). These metrics are semantically similar, and the minor differences do not affect our conclusions.
  • Figure 4: Comparisons between various algorithms based on FP16.
  • Figure 5: Ablation on the precision combinations.
  • ...and 1 more figures