Table of Contents
Fetching ...

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen

TL;DR

Jackpot introduces Optimal Budgeted Rejection Sampling (OBRS) to directly reduce the distribution gap between rollout and target policies in reinforcement learning for large language models. It couples OBRS with a unified objective that jointly updates the policy and rollout models, augmented by top-$k$ probability estimation and batch-level bias correction to enable efficient, scalable decoupled rollout training. The authors prove OBRS reduces the KL divergence to the target distribution under a configurable acceptance budget and demonstrate empirically that Jackpot improves training stability versus importance-sampling baselines, achieving on-policy-like performance for Qwen-3-8B-Base over 300 steps. This work offers a practical direction for decoupled rollout generation in RL for LLMs, potentially lowering training cost while preserving stability and performance.

Abstract

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

TL;DR

Jackpot introduces Optimal Budgeted Rejection Sampling (OBRS) to directly reduce the distribution gap between rollout and target policies in reinforcement learning for large language models. It couples OBRS with a unified objective that jointly updates the policy and rollout models, augmented by top- probability estimation and batch-level bias correction to enable efficient, scalable decoupled rollout training. The authors prove OBRS reduces the KL divergence to the target distribution under a configurable acceptance budget and demonstrate empirically that Jackpot improves training stability versus importance-sampling baselines, achieving on-policy-like performance for Qwen-3-8B-Base over 300 steps. This work offers a practical direction for decoupled rollout generation in RL for LLMs, potentially lowering training cost while preserving stability and performance.

Abstract

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top- probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
Paper Structure (44 sections, 5 theorems, 35 equations, 6 figures, 10 tables, 2 algorithms)

This paper contains 44 sections, 5 theorems, 35 equations, 6 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.3

Let $\boldsymbol{p}$ be the target distribution and $\boldsymbol{q}$ be the proposal distribution. For any $\lambda > 0$, define the OBRS acceptance rule Then the post-rejection distribution $\tilde{\boldsymbol{q}}$ is strictly closer to $\boldsymbol{p}$ than the original proposal $\boldsymbol{q}$ in the sense that whenever $\lambda < \max_i \frac{p_i}{q_i}$.

Figures (6)

  • Figure 1: (a) Off-policy RL training is desirable for alleviating the severely high cost from the rollout stage of on-policy RL. The strong but inference costly policy model can be approximated by cheap but crappy counterparts to speedup the rollout stage. However, the trajectories rollout by the actor model must be aligned in probability distribution with the policy model's distribution to make offpolicy training viable. In (b) we show an extreme off-policy case where we show training setting use a Qwen3-1.7B-Base model training rollout to train a Qwen3-8B-Base model policy. Without any alignment procedures, training collapses (pink). Prior method TIS (green) also shows a significant gap towards Qwen3-8B-Base on-policy baseline (purple), while collapsing, using TIS sees KL divergence also violently increasing. Our proposed point Jackpot provides much more stable training under the setting.
  • Figure 2: We conducted numerical experiments in (a) and (b), where we simulate the LLMs' output distribution with randomly generated Dirichlet distribution with controllable noise to attain different levels of KL divergence. (a) plots the simulated Jackpot acceptance rate across different pairs of actor-policy distributions. While overal trend sees acceptance rate slowly decreasing as the distributions move further apart, the acceptance rate remains high ($>90\%$) throughout the spectrum. For reference, we also marks the actor-policy gap of different common seen off-policy settings. In (b), we show that Jackpot significantly shrinks the KL divergence between the target distribution versus the applied distribution in our simulations, sometimes by an order of magnitude. In (c), our proposed method, Jackpot (yellow) maintains small KL divergence between actor and policy model probability distribution, while without alginment and TIS both seen KL divergence explosively rise up as the training continues.
  • Figure 3: Illustration of Jackpot Pipeline focusing on Optimal Budgeted Rejection Sampling (OBRS) and Reweighting Procedures
  • Figure 4: Jackpot enables probability distribution alignment beyond existing methods. On the extreme two model joint training setting, with Jackpot, the smaller and weaker model is able to rollout trajectories which are used by the bigger stronger models for computing its training. We show that prior TIS methods, even added the KL, consistently suffers from unstable training across three different settings. Qwen2.5 series 1.5B and 3B, Qwen3 1.7B and 4B, and Qwen3 1.7B and 8B base models. In contrast, Jackpot leads to comparable performance with the large model on-policy performance.
  • Figure 5: (a) Jackpot enables removal of clipping from stale RL training. (b) Jackpot isn't showing improvement nor harm when actor-policy distributions are relatively close and can be sufficiently corrected by TIS. (c) When removing KL, Jackpot consistently sustains longer than TIS counterparts.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 3.1: Rejection Sampling (RS)
  • Definition 3.2: Optimal Budget Rejection Sampling (OBRS) verine2024optimal
  • Theorem 3.3: OBRS Improves Distribution Alignment
  • Theorem 3.4: Optimality of OBRS under a Fixed Acceptance Budget
  • Theorem A.1: Budgeted Optimal Acceptance
  • proof
  • Proposition A.2: Monotonic KL Contraction
  • proof
  • Corollary A.3: Strict Improvement over Proposal
  • proof