Table of Contents
Fetching ...

Learning from Failures in Multi-Attempt Reinforcement Learning

Stephen Chung, Wenyu Du, Jie Fu

TL;DR

The paper addresses inefficiencies in single-turn reinforcement learning for reasoning tasks by introducing a multi-turn, multi-attempt RL framework for large language models. It trains a small LLM (1.5B) using PPO on a math dataset, allowing up to $M=5$ attempts per question with randomized remaining attempts $N$ and a ground-truth reward structure that incentivizes correct answers across attempts. Results show the multi-attempt setup yields stronger refinement, with math-benchmark accuracy increasing from $45.6\%$ (1 attempt) to $52.5\%$ (2 attempts) and up to $53.82\%$ with more attempts, while a single-turn baseline shows only marginal gains; across five benchmarks, there are modest base gains and clear improvements in response refinement. The findings suggest multi-turn RL can foster self-refinement and emergent capabilities like the Aha Moment, enabling more adaptive reasoning and learning from feedback, with code available for replication.

Abstract

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

Learning from Failures in Multi-Attempt Reinforcement Learning

TL;DR

The paper addresses inefficiencies in single-turn reinforcement learning for reasoning tasks by introducing a multi-turn, multi-attempt RL framework for large language models. It trains a small LLM (1.5B) using PPO on a math dataset, allowing up to attempts per question with randomized remaining attempts and a ground-truth reward structure that incentivizes correct answers across attempts. Results show the multi-attempt setup yields stronger refinement, with math-benchmark accuracy increasing from (1 attempt) to (2 attempts) and up to with more attempts, while a single-turn baseline shows only marginal gains; across five benchmarks, there are modest base gains and clear improvements in response refinement. The findings suggest multi-turn RL can foster self-refinement and emergent capabilities like the Aha Moment, enabling more adaptive reasoning and learning from feedback, with code available for replication.

Abstract

Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt

Paper Structure

This paper contains 5 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Evaluation accuracy as a function of the number of allowed attempts during evaluation, averaged across five benchmarks: AIME 2024, MATH 500, AMC 2023, Minerva Math, and OlympiadBench. Both LLMs are based on Qwen 2.5 Math 1.5B and fine-tuned via RL on a small math dataset in either multi-attempt tasks or single-turn tasks (baseline).
  • Figure 2: Illustration of the multi-attempt question-answer task. We extend the single-turn question-answer task from DeepSeek R1 to a multi-attempt setting, enabling iterative refinement.
  • Figure 3: An example of a multi-attempt dialogue ($N=2$) from a fine-tuned LLM, where the LLM makes a mistake on the first attempt but learns to correct it in the second attempt.
  • Figure 4: Training and evaluation performance of the LLMs. (a) Training reward as a function of training steps. (b) Average evaluation accuracy across five benchmarks as a function of training steps, evaluated under the standard single-attempt setting.
  • Figure 5: Evaluation accuracy as a function of the number of allowed attempts during evaluation on individual benchmarks.