Table of Contents
Fetching ...

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao

Abstract

Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Abstract

Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of Pass@1 and Pass@ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.
Paper Structure (56 sections, 9 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 56 sections, 9 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Conceptual illustration of entropy collapse versus counteracting entropy collapse.
  • Figure 2: (a) Sequence length and PPL distributions of the 7B model's training data. (b) Performance trajectories of the 7B model fine-tuned with OXA$_{\text{MLE}}$ and OXA$_{\text{Full}}$ on AIME24. (c) Policy entropy dynamics during RLVR training for various fine-tuning methods based on the 7B model. (d) Performance gains of SFT and OXA$_{\text{MLE}}$ on the 1.5B model and the Minerva benchmark, grouped by the base model's pass counts.
  • Figure 3: Pass@$k$ curves of 1.5B and 7B models across different fine-tuning methods on CMIMC25 and HMMT25.
  • Figure 4: (a)-(b) Average performance across 6 mathematical benchmarks for LLaMA3.2-3B and Qwen3-1.7B under various fine-tuning methods. (c) Scalability of SFT and $\text{OXA}_{\text{MLE}}$ as training data increases, compared against ARN-1.1-SFT—a model fine-tuned on millions of samples using the same backbone. (d) Generalization performance of 1.5B variants on out-of-domain benchmarks, including GPQA and MMLU-STEM.
  • Figure 5: (a) Results of OXA$_{\text{Full}}$ with Clip-Cov (OXA$_{\text{Full}}$+CC) on AIME25. (b) Comparison across different reasoning data selection strategies on AIME25. "Low PPL", "Random", and "High PPL + Long" correspond to SFT$_{\text{LP}}$, SFT, and OXA$_{\text{MLE}}$, respectively.
  • ...and 1 more figures