Table of Contents
Fetching ...

Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

Andrei Baroian, Rutger Berger

Abstract

Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.

Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

Abstract

Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.
Paper Structure (22 sections, 12 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 22 sections, 12 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Prompt Replay visualization. In each step, we insert prompts with high learnability into a buffer. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times are controlling aggressiveness vs risk of overfitting.
  • Figure 2: Prompt Replay vs Baseline (OLMo-RL). Prompt Replay shows: higher mean $|\mathrm{Adv}|$ (3rd row), resulting in more signal from the data; lower number of prompts (2nd row) with pass rate = 0, wasting less compute on unusable prompts; earlier gains in the average accuracy over 6 benchmarks (1st row), but plateaus and converges with the baseline; benchmark score is max at 10 instead of 100
  • Figure 3: Average accuracy over 6 benchmarks over time for cooldown-step ablation on Qwen2.5 1.5B, Dolci dataset; All cooldown steps have similar performance; The baseline was trained for 800 steps, a prompt was reused maximum 15 times.
  • Figure 4: Avg Accuracy over 6 benchmarks, Baseline OLMo-RL Qwen 2.5 1.5B on Dolci, training on full dataset vs 32 prompts
  • Figure 5: Training dynamics for the main results: (row 1) steps vs. time, (row 2) sequence length, (row 3) verifiable reward, (row 4) policy entropy. Columns correspond to Llama 3.1 3B (Dolci), Qwen 3 8B (Dolci), and Qwen 3 8B (Polaris).