Table of Contents
Fetching ...

Outcome-based Exploration for LLM Reasoning

Yuda Song, Julia Kempe, Remi Munos

TL;DR

The paper addresses the paradox that reinforcement learning (RL) post-training improves LLM reasoning accuracy but reduces generation diversity, hindering real-world deployment. It reframes RL as a sampling process on the training data and introduces outcome-based exploration, including Historical UCB and Batch exploration, along with a theoretical outcome-based bandit model, to preserve diversity while boosting correctness. Empirical results on standard math benchmarks with $Llama$ and $Qwen$ show that these methods improve accuracy and mitigate diversity collapse, achieving a better balance between exploitation and diversity. This work provides a practical path for RL-based reasoning that maintains test-time diversity essential for scalable deployment.

Abstract

Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.

Outcome-based Exploration for LLM Reasoning

TL;DR

The paper addresses the paradox that reinforcement learning (RL) post-training improves LLM reasoning accuracy but reduces generation diversity, hindering real-world deployment. It reframes RL as a sampling process on the training data and introduces outcome-based exploration, including Historical UCB and Batch exploration, along with a theoretical outcome-based bandit model, to preserve diversity while boosting correctness. Empirical results on standard math benchmarks with and show that these methods improve accuracy and mitigate diversity collapse, achieving a better balance between exploitation and diversity. This work provides a practical path for RL-based reasoning that maintains test-time diversity essential for scalable deployment.

Abstract

Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.

Paper Structure

This paper contains 38 sections, 10 theorems, 35 equations, 11 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

For any algorithm, there exists an outcome-partitioned bandit instance with $K$ arms and $m$ outcomes such that the expected regret after $T$ rounds is at least $\Omega\left(\min\{T, K\}\right)$.

Figures (11)

  • Figure 1: Test performance comparison (averaged across $\mathtt{MATH\text{-}500}$, $\mathtt{AIME2024/2025}$, $\mathtt{AMC23}$) between our exploration methods ($\mathtt{UCB\text{-}Con}\xspace$ and $\mathtt{Batch}$) and the $\mathtt{GRPO}$ baseline, with $\mathtt{Llama\text{-}3.1\text{-}8B\text{-}Instruct}$ on the easy dataset (left) and $\mathtt{Qwen\text{-}2.5\text{-}7B\text{-}Base}$ on the medium dataset (right). We report pass@$k$ for $k \in \{1,2,4,8,16,32\}$ on an early checkpoint (at timestep 100) and the final checkpoint (at timestep 700). We repeat each experiment with 3 different random seeds and plot the mean performance. The exploration methods outperform the baseline on nearly all metrics across the training process (except $\mathtt{Qwen\text{-}2.5\text{-}7B\text{-}Base}$ with $\mathtt{UCB\text{-}Con}\xspace$ on pass@1 on the early checkpoint due to exploration, but it has much higher pass@32 rate), and better exploitation-exploration trade-off and mitigation of overoptimization (note that the last checkpoint of Llama 3.1 8B with Vanilla RL has overall worse performance than its early checkpoint due to overoptimization).
  • Figure 2: Comparison between RL training dynamics and base model sampling, on both easy and medium difficulty datasets, with $\mathtt{Llama\text{-}3.1\text{-}8B\text{-}Instruct}$ and $\mathtt{Qwen\text{-}2.5\text{-}7B\text{-}Base}$. Top row: number of questions solved so far; Bottom row: number of different answers sampled so far. The bottom x-ticks are the number of epochs $t$ for training, and the top x-ticks are the corresponding $k$ for sampling from the base model. We convert $k = n t$ where $n$ is the number of samples per epoch and $t$ is the epoch index. We use $n=16$ for pass@$k$ comparison and $n=8$ for diff@$k$ comparison. In the diff@$k$ comparison, solid lines denote the average number of different answers per all questions, and dashed lines denote the average number of different answers per unsolved questions (i.e., all answers are wrong so far). The fact that RL has lower diff@$k$ on unsolved questions than the base model indicates the transfer of diversity degradation.
  • Figure 3: Training performance comparison between different $\mathtt{UCB}$ variants and the $\mathtt{GRPO}$ baseline, with $\mathtt{Llama\text{-}3.1\text{-}8B\text{-}Instruct}$ on the easy dataset (left) and $\mathtt{Qwen\text{-}2.5\text{-}7B\text{-}Base}$ on the medium dataset (right). For each subplot: Left: fraction of questions solved so far; Right: number of different answers sampled on the questions that the model has yet to solve (i.e., sample one correct answer historically). The x-axis denotes the number of gradient updates as we train all models fully on-policy. We repeat each experiment with 3 different random seeds and plot the mean performance.
  • Figure 4: Test performance comparison between different $\mathtt{UCB}$ variants and the $\mathtt{GRPO}$ baseline, with $\mathtt{Llama\text{-}3.1\text{-}8B\text{-}Instruct}$ on the easy dataset (top) and $\mathtt{Qwen\text{-}2.5\text{-}7B\text{-}Base}$ on the medium dataset (bottom). We report pass@$k$ for $k \in \{1,2,4,8,16,32\}$ at every 20 training steps. We repeat each experiment with 3 different random seeds and plot the mean performance (see \ref{['sec:quant-results']} for error bars). The metrics are calculated based on 32 samples per question during evaluation.
  • Figure 5: Training performance comparison between $\mathtt{Batch}$ and $\mathtt{UCB\text{-}Con}\xspace$, $\mathtt{Llama\text{-}3.1\text{-}8B\text{-}Instruct}$ on the easy dataset (left) and $\mathtt{Qwen\text{-}2.5\text{-}7B\text{-}Base}$ on the medium dataset (right). For each subplot: Left: fraction of questions solved so far; Right: number of different answers sampled on the questions that the model has yet to solve (i.e., sample one correct answer historically). The x-axis denotes the number of gradient updates as we train all models fully on-policy. We repeat each experiment with 3 different random seeds and plot the mean performance.
  • ...and 6 more figures

Theorems & Definitions (11)

  • Theorem 1: Informal version of \ref{['thm:lb-no-gen-minimax']}
  • Theorem 2: Informal version of \ref{['thm:pa-ucb']}
  • Theorem 3: Lower bound for outcome-based bandit
  • Theorem 4: Upper bound under \ref{['ass:balanced']}
  • Theorem 5: Regret upper bound under strong generalization
  • Theorem 6: Upper bound under soft generalization
  • Lemma 1: First hit of the optimal outcome
  • Remark 1
  • Lemma 2
  • Lemma 3: Coupon-collector coupling motwani1996randomized
  • ...and 1 more