Table of Contents
Fetching ...

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang

TL;DR

This work investigates whether self-improvement in math reasoning LLMs can be achieved within supervised learning by leveraging negative feedback. It introduces Negative-aware Fine-Tuning (NFT), which constructs an implicit negative policy to learn from negative generations while still training on positive data, enabling direct policy optimization without external teachers. Theoretical analysis reveals NFT and GRPO are equivalent under strict on-policy training, and empirically NFT matches or surpasses leading RL methods like GRPO and DAPO on 7B and 32B Qwen models across multiple benchmarks. The results demonstrate a principled bridge between supervised and reinforcement learning approaches in binary-feedback settings, with negative data, token-level loss, and principled weighting contributing to robust performance improvements in math reasoning tasks.

Abstract

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

TL;DR

This work investigates whether self-improvement in math reasoning LLMs can be achieved within supervised learning by leveraging negative feedback. It introduces Negative-aware Fine-Tuning (NFT), which constructs an implicit negative policy to learn from negative generations while still training on positive data, enabling direct policy optimization without external teachers. Theoretical analysis reveals NFT and GRPO are equivalent under strict on-policy training, and empirically NFT matches or surpasses leading RL methods like GRPO and DAPO on 7B and 32B Qwen models across multiple benchmarks. The results demonstrate a principled bridge between supervised and reinforcement learning approaches in binary-feedback settings, with negative data, token-level loss, and principled weighting contributing to robust performance improvements in math reasoning tasks.

Abstract

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

Paper Structure

This paper contains 19 sections, 6 theorems, 44 equations, 11 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Consider the maximum-likelihood objective for training the implicit negative policy $\pi^-_\theta$: Assuming unlimited data and model capacity, the optimal solution for solving Eq. eq:negative_loss is

Figures (11)

  • Figure 1: A spectrum of online algorithms for LLM fine-tuning. NFT bridges reinforcement learning and supervised learning methods through the leverage of negative feedback via supervision.
  • Figure 2: Illustration of the NFT algorithm. Data Collection: An LLM $\pi$ generates answers to a set of math questions. Generation results are split into two sub-datasets based on their answer correctness. Policy Optimization: By constructing an implicit policy for modeling negative data, NFT enables direct policy optimization on both positive and negative answers via maximum-likelihood training.
  • Figure 3: Left: Policy Splitting. The generation policy can be split into a positive policy and a negative policy, and re-expressed as their linear combination. Right: Policy Improvement. By iteratively optimizing towards its positive split, an LLM policy $\pi_0$ can improve continuously.
  • Figure 4: Gradient weight for NFT and GRPO.
  • Figure 5: Comparison of the released NFT-7B with other zero-style math models of Qwen series.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Theorem 3.1: Policy Optimization with Negative Answers
  • Proposition 4.1: Algorithm Gradient Comparision
  • Proposition 4.2: On-policy Gradient Equivalence
  • Theorem A.1: Policy Optimization with Negative Answers
  • proof
  • Proposition A.2: Algorithm Gradient Comparision
  • proof
  • Remark A.3: Dr. GRPO
  • Proposition A.4: On-policy Gradient Equivalence
  • proof