Table of Contents
Fetching ...

Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

Reinhard Heckel, Mahdi Soltanolkotabi, Christos Thramboulidis

TL;DR

This work investigates asymmetric prompt weighting for reinforcement learning with verifiable rewards in LLM settings. It introduces four weightings (Linear-R, Sqrt-R, Plateau-R, Uniform-R) that upweight hard prompts and maintain nonzero gradient signals even when ρ̂=0, and analyzes them under both binary rewards and surrogate-reward perspectives. A theoretical policy-dynamics framework yields regular-time and effective-time optimal weights, specifically ω(ρ) ∝ 1/√[ρ(1−ρ)] for regular time and ω(ρ) ∝ 1/[ρ√(1−ρ)] for effective time, with GRPO optimal in regular time and Sqrt-R optimal in effective time. Empirically, asymmetric weighting substantially improves from-scratch RL on TinyZero and GSM8K, while post-SFT RL (MATH, DAPO-math) shows little additional gain from weighting differences. The results provide practical guidance on regime-aware weight design and a dynamics-based justification for when asymmetry accelerates convergence in RLVR.

Abstract

Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.

Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

TL;DR

This work investigates asymmetric prompt weighting for reinforcement learning with verifiable rewards in LLM settings. It introduces four weightings (Linear-R, Sqrt-R, Plateau-R, Uniform-R) that upweight hard prompts and maintain nonzero gradient signals even when ρ̂=0, and analyzes them under both binary rewards and surrogate-reward perspectives. A theoretical policy-dynamics framework yields regular-time and effective-time optimal weights, specifically ω(ρ) ∝ 1/√[ρ(1−ρ)] for regular time and ω(ρ) ∝ 1/[ρ√(1−ρ)] for effective time, with GRPO optimal in regular time and Sqrt-R optimal in effective time. Empirically, asymmetric weighting substantially improves from-scratch RL on TinyZero and GSM8K, while post-SFT RL (MATH, DAPO-math) shows little additional gain from weighting differences. The results provide practical guidance on regime-aware weight design and a dynamics-based justification for when asymmetry accelerates convergence in RLVR.

Abstract

Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.
Paper Structure (33 sections, 5 theorems, 76 equations, 11 figures, 2 tables)

This paper contains 33 sections, 5 theorems, 76 equations, 11 figures, 2 tables.

Key Result

Proposition 1

Suppose the success rate $\rho_t$ of a fixed prompt evolves as in Equation eq:ODE with initialization $\rho_0$. Let $T(\rho_0,\rho_*;\omega)$ be the time required so that $\rho_t=\rho_*$ for target success rate $\rho_*>\rho_0$. The non-negative weight $\omega:[0,1]\rightarrow\mathbb{R}_{\geq 0}$ tha

Figures (11)

  • Figure 1: From-scratch RL: TinyZero. Test error (i.e., fraction of correctly solved problems, Pass@1) during reinforcement learning. Starting from a score of $0.025$ the algorithms based on asymmetric weighting (Linear-R, Plateau-R, and Sqrt-R) climb to about $0.8$ while the symmetric weightings (RLOO, Uniform-R, and GRPO) only reach around $0.74$.
  • Figure 2: Effective weights $\omega_x(\rho)\cdot \rho(1-\rho)$ assigned to gradients, and advantages assigned to correct/wrong responses for the five considered prompt weightings. The range $1/32,\ldots,31/32$ is shown since the number of rollouts $M$ is typically at most $32$.
  • Figure 3: Distribution of the fraction of correct responses out of 16, $\hat{\rho}_x$, for each prompt during a run of Linear-R on TinyZero. At early steps, the distribution is heavily concentrated at low $\hat{\rho}$, reflecting that most prompts are difficult for the base model. Importantly, a substantial number of difficult prompts remain even at later steps (e.g., step 220). Linear-R continues to provide strong gradient signal for these prompts due to its weighting, unlike GRPO.
  • Figure 4: From-scratch RL: GSM8K. Test error (i.e., fraction of correctly solved problems, Pass@1) during reinforcement learning on the GSM8K benchmark for the Llama-3.1-8B (base) model. Asymmetric weighting (Plateau-R, Linear-R, Sqrt-R) outperform GRPO, demonstrating the benefit of up-weighting difficult prompts.
  • Figure 5: Post-SFT RL: Test error (i.e., fraction of correctly solved problems) during reinforcement learning on the MATH dataset (left) and on DAPO math (right). All methods performs similarly well.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Lemma 1
  • proof