Table of Contents
Fetching ...

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, Deqing Wang

TL;DR

This work tackles RLVR by addressing two core issues: noisy, heterogeneous rollouts within groups and the short horizon with limited data reuse. It introduces Contextual Rollout Bandits (CBS), a neural scheduler that treats each rollout as a contextual bandit arm and performs both intra-group filtering and global reuse via a replay buffer. The authors establish a theoretical connection to contextual bandits, proving sublinear regret bounds, and demonstrate consistent performance and training-efficiency gains across six math benchmarks and multiple RLVR optimizers. The approach significantly improves data efficiency and final reasoning performance, enabling more scalable and reliable RLVR for large language models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.

Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards

TL;DR

This work tackles RLVR by addressing two core issues: noisy, heterogeneous rollouts within groups and the short horizon with limited data reuse. It introduces Contextual Rollout Bandits (CBS), a neural scheduler that treats each rollout as a contextual bandit arm and performs both intra-group filtering and global reuse via a replay buffer. The authors establish a theoretical connection to contextual bandits, proving sublinear regret bounds, and demonstrate consistent performance and training-efficiency gains across six math benchmarks and multiple RLVR optimizers. The approach significantly improves data efficiency and final reasoning performance, enabling more scalable and reliable RLVR for large language models.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language models. However, existing RLVR methods utilize rollouts in an indiscriminate and short-horizon manner: responses of heterogeneous quality within each prompt are treated uniformly, and historical rollouts are discarded after a single use. This leads to noisy supervision, poor sample efficiency, and suboptimal policy updates. We address these issues by formulating rollout scheduling in RLVR as a contextual bandit problem and proposing a unified neural scheduling framework that adaptively selects high-value rollouts throughout training. Each rollout is treated as an arm whose reward is defined by the induced performance gain between consecutive optimization steps. The resulting scheduler supports both noise-aware intra-group selection and adaptive global reuse of historical rollouts within a single principled framework. We provide theoretical justification by deriving sublinear regret bounds and showing that enlarging the rollout buffer improves the achievable performance upper bound. Experiments on six mathematical reasoning benchmarks demonstrate consistent gains in performance and training efficiency across multiple RLVR optimization methods.
Paper Structure (33 sections, 6 theorems, 49 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 33 sections, 6 theorems, 49 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

Theorem 5.3

Let $B$ denote the batch size and $G$ denote the group size. There exist constants $C_1, C_2>0$ such that, if $m \ge C_1$ and $L \ge C_2$, then for any $i,j \in [T]$ with $i \ge j$, it holds that

Figures (12)

  • Figure 1: Workflow of CBS, introducing a neural scheduler plugin that augments RLVR training by selectively using rollouts. The scheduler also updates based on feedback.
  • Figure 2: Train dynamics of different RLVR methods.
  • Figure 3: Ablation study results on Qwen3-4B-Base.
  • Figure 4: Comparison of Entropy and average score on the evaluation set for CBS and w/o Entropy.
  • Figure 5: Training dynamics of the average score on the validation set
  • ...and 7 more figures

Theorems & Definitions (11)

  • Definition 3.1: Intra-Group Scheduling Problem
  • Definition 3.2: Global Scheduling Problem
  • Definition 3.3: Performance Gain Reward
  • Theorem 5.3
  • Theorem 5.4
  • Definition 2.1: NTK DBLP:conf/nips/JacotHG18DBLP:conf/nips/WangADSG21
  • Lemma 2.2
  • proof
  • Lemma 2.3: Theorem 3.1 of DBLP:conf/nips/YunSJ19
  • Lemma 2.4: Theorem 2 of DBLP:journals/corr/jmlr24
  • ...and 1 more