Table of Contents
Fetching ...

Think Outside the Policy: In-Context Steered Policy Optimization

Hsiu-Yuan Huang, Chenming Tang, Weijie Liu, Saiyong Yang, Yunfang Wu

TL;DR

This work addresses the limited exploration inherent in Group Relative Policy Optimization (GRPO) by introducing In-Context Steered Policy Optimization (ICPO), which leverages the in-context learning capability of large reasoning models to provide expert guidance from existing datasets. ICPO combines Mixed-Policy GRPO with Implicit Expert Forcing, Expert Region Reject Sampling, and Reward Shaping with an annealed expert bonus to expand exploration, filter unreliable off-policy trajectories, and stabilize training. Empirical results on mathematical reasoning benchmarks show consistent improvements over vanilla GRPO across model scales, with notable gains on expert-domain data and solid generalization to out-of-distribution tasks. The approach eliminates the need for external advanced LRMs, offering a scalable RLVR framework that enhances reasoning performance through data-driven, in-context guidance.

Abstract

Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.

Think Outside the Policy: In-Context Steered Policy Optimization

TL;DR

This work addresses the limited exploration inherent in Group Relative Policy Optimization (GRPO) by introducing In-Context Steered Policy Optimization (ICPO), which leverages the in-context learning capability of large reasoning models to provide expert guidance from existing datasets. ICPO combines Mixed-Policy GRPO with Implicit Expert Forcing, Expert Region Reject Sampling, and Reward Shaping with an annealed expert bonus to expand exploration, filter unreliable off-policy trajectories, and stabilize training. Empirical results on mathematical reasoning benchmarks show consistent improvements over vanilla GRPO across model scales, with notable gains on expert-domain data and solid generalization to out-of-distribution tasks. The approach eliminates the need for external advanced LRMs, offering a scalable RLVR framework that enhances reasoning performance through data-driven, in-context guidance.

Abstract

Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.

Paper Structure

This paper contains 31 sections, 12 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of optimization dynamics in parameter space. GRPO exploration is confined to the current policy’s distribution, limiting trajectory diversity and often leading to suboptimal convergence. While prior methods expand exploration by incorporating expert rollouts generated by stronger LRMs, ICPO leverages existing datasets—beyond the original training data—for mixed-policy GRPO with implicit expert forcing, eliminating reliance on external expert models.
  • Figure 2: Comparison between 0-shot and 1-shot ICL on reasoning accuracy across benchmark datasets.
  • Figure 3: Effect of in-context steering on exploration and diversity. Compared with temperature-based sampling, 1-shot ICL produces trajectories with larger semantic distribution distances (shown as violin plots) and a higher ratio of flipped-correct generations (highlighted red dots), indicating that expert conditioning provides a stronger and more targeted exploration signal.
  • Figure 4: ICPO Overall Framework. ICPO performs mixed-policy GRPO using off-policy trajectories generated by the policy model itself via implicit expert forcing.
  • Figure 5: Reward curves of Qwen3-8B over training steps across test and train sets.
  • ...and 1 more figures