Table of Contents
Fetching ...

$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

Pinzheng Wang, Shuli Xu, Juntao Li, Yu Luo, Dong Li, Jianye Hao, Min Zhang

TL;DR

This work introduces Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.

$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

TL;DR

This work introduces Reinforcement Learning with Re-solving (Re), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.
Paper Structure (37 sections, 6 equations, 22 figures, 3 tables)

This paper contains 37 sections, 6 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: (a) Accuracy improvements of DAPO and Re$^2$ on Qwen2.5-7B-Instruct at each training step, with comparable numbers of generated and trained tokens per step. (b) Test-time scaling of DAPO and Re$^2$ under the same training budget, as the number of samples increases.
  • Figure 2: Illustration of reasoning with re-solving. (a) In standard reasoning, when the initial steps are suboptimal, the LLM struggles to reach the correct answer even after generating more reasoning steps and tokens. (b) Reasoning with re-solving, allowing the model to abandon an unpromising path and restart at any point, leads to a higher overall accuracy. (c) Example trace of DAPO and Re$^2$: Both models initially attempt an incorrect approach using the AM–GM inequality; however, Re$^2$ detects the failure, restarts, and arrives at the correct answer.
  • Figure 3: Correlation between CoT length and reasoning performance. The bars represent the proportion of cases within different CoT length intervals, while the line indicates the average accuracy for responses in each interval. The large plots in the top row show the correlation across the entire dataset. The small plots in the second row illustrate the correlation at the level of individual problems.
  • Figure 4: Number of cases where the relative drop exceeds the threshold when continuing from different proportions of incorrect response prefixes, compared to reasoning from scratch. "All Drops" counts all such cases at each prefix proportion, while "First Drops" records the earliest prefix for each response where the drop exceeds the threshold.
  • Figure 5: The framework of Re$^2$. For each query, Re$^2$ samples multiple prefixes, then generates multiple continuations for each prefix. The advantage is calculated within each group, while the out-of-group accuracy is used as the reward for the redo action.
  • ...and 17 more figures