Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning
Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen
TL;DR
Jackpot introduces Optimal Budgeted Rejection Sampling (OBRS) to directly reduce the distribution gap between rollout and target policies in reinforcement learning for large language models. It couples OBRS with a unified objective that jointly updates the policy and rollout models, augmented by top-$k$ probability estimation and batch-level bias correction to enable efficient, scalable decoupled rollout training. The authors prove OBRS reduces the KL divergence to the target distribution under a configurable acceptance budget and demonstrate empirically that Jackpot improves training stability versus importance-sampling baselines, achieving on-policy-like performance for Qwen-3-8B-Base over 300 steps. This work offers a practical direction for decoupled rollout generation in RL for LLMs, potentially lowering training cost while preserving stability and performance.
Abstract
Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
