Table of Contents
Fetching ...

Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, Xiangyang Ji

TL;DR

This work introduces Model Predictive Prompt Selection (MoPPS), a Bayesian framework that online-predicts prompt difficulty without costly LLM inferences to accelerate RL finetuning of reasoning models. By modeling each prompt's success rate with a Beta prior and performing recursive posterior updates, MoPPS enables Thompson-sampling-based prompt selection focused on intermediate-difficulty prompts, achieving significant reductions in LLM rollouts. The approach is algorithm-agnostic and compatible with PPO, GRPO, and Reinforce++, delivering up to 1.8x training speedups and substantial performance gains across mathematics, planning, and geometry benchmarks while matching or beating evaluation-heavy baselines with far fewer queries. The results demonstrate MoPPS's potential for improving sample efficiency in large-scale reasoning model training and its applicability to diverse RL finetuning pipelines.

Abstract

Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline's reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts. Our code is available at https://github.com/thu-rllab/MoPPS.

Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?

TL;DR

This work introduces Model Predictive Prompt Selection (MoPPS), a Bayesian framework that online-predicts prompt difficulty without costly LLM inferences to accelerate RL finetuning of reasoning models. By modeling each prompt's success rate with a Beta prior and performing recursive posterior updates, MoPPS enables Thompson-sampling-based prompt selection focused on intermediate-difficulty prompts, achieving significant reductions in LLM rollouts. The approach is algorithm-agnostic and compatible with PPO, GRPO, and Reinforce++, delivering up to 1.8x training speedups and substantial performance gains across mathematics, planning, and geometry benchmarks while matching or beating evaluation-heavy baselines with far fewer queries. The results demonstrate MoPPS's potential for improving sample efficiency in large-scale reasoning model training and its applicability to diverse RL finetuning pipelines.

Abstract

Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline's reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts. Our code is available at https://github.com/thu-rllab/MoPPS.

Paper Structure

This paper contains 71 sections, 2 theorems, 41 equations, 14 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Define the posterior mean estimate at step $t$ as $\bar{\gamma}^{t}_\tau := \frac{\alpha^{t'}_\tau}{\alpha^{t'}_\tau + \beta^{t'}_\tau}$, and assume the true success rate drifts slowly, i.e., $|\gamma^{t}_\tau - \gamma^{t-1}_\tau| \le \delta,\ \forall{t}$. Then, with probability at least $1 - 2\exp(

Figures (14)

  • Figure 1: Spearman rank correlation and $p$-value over training steps between the predicted prompt difficulty from our Bayesian surrogate and the empirical success rate. The strong correlation indicates that our method effectively predicts prompt difficulty without incurring costly LLM inferences.
  • Figure 2: Performance and computational efficiency of different prompt selection methods on Countdown. Our developed MoPPS surpasses uniform selection in both training efficiency and performance, while reducing computation by 79% fewer rollouts compared to DS yu2025dapo.
  • Figure 3: Probabilistic graphical model for RL finetuning of LLMs. The reward signal $\bm{r}^t_{\tau_{t,i}}$ is a set of binary values evaluating the $k$ generated responses, governed by the latent success rate $\gamma^t_{\tau_{t,i}}$. The prompt batch $\{\tau_{t,i}\}_{i=1}^{\mathcal{B}}$ is selected under specific criteria based on current LLM $\bm\theta_t$. The white and grey nodes respectively denote observed and latent variables.
  • Figure 4: Framework Overview. Left: Comparison between Dynamic Sampling (Oracle), which filters prompts based on actual LLM evaluation on candidates, and our Model Predictive Prompt Selection (MoPPS), which predicts success rates to avoid extra inference cost. Right: MoPPS predicts success rates for candidates from posterior parameters, based on which prompts closest to a target $\gamma^*$ are selected for training; the posterior is then updated using new feedback.
  • Figure 5: Training curves of MoPPS and baseline methods across three reasoning tasks with varying backbone sizes. Notably, DS serves as an oracle baseline, as it relies on expensive exact LLM evaluations and demands significantly more rollouts.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 3.1: Prompt Selection Bernoulli Bandit
  • Theorem 3.1: Bounded Success Rate Estimation Error
  • Theorem B.1: Bounded Success Rate Estimation Error
  • proof