Table of Contents
Fetching ...

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li

TL;DR

OptPO addresses the inefficiency of fixed-budget rollout in test-time policy optimization for LLMs facing distribution shifts. It reframes reward estimation as a Bayesian sequential probability ratio test, enabling adaptive rollout stopping and reuse of rollouts for on-policy updates with PPO/GRPO or SFT. The approach achieves substantial compute savings (often 30-50% token reductions) while preserving or improving accuracy across diverse math and science benchmarks and backbone models. This work provides a principled, plug-and-play framework that unifies test-time learning with reinforcement learning under an efficient, label-free paradigm.

Abstract

Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

TL;DR

OptPO addresses the inefficiency of fixed-budget rollout in test-time policy optimization for LLMs facing distribution shifts. It reframes reward estimation as a Bayesian sequential probability ratio test, enabling adaptive rollout stopping and reuse of rollouts for on-policy updates with PPO/GRPO or SFT. The approach achieves substantial compute savings (often 30-50% token reductions) while preserving or improving accuracy across diverse math and science benchmarks and backbone models. This work provides a principled, plug-and-play framework that unifies test-time learning with reinforcement learning under an efficient, label-free paradigm.

Abstract

Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

Paper Structure

This paper contains 33 sections, 16 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Performance (mean@16 accuracy vs rollout time) comparison of OptPO and TTRL on the MATH-500 benchmark, integrating different RL algorithms and using (i) Llama-3.2-1B-Instruct (top row) and (ii) Qwen2.5-7B (bottom row) as backbones.