Table of Contents
Fetching ...

Provable and Practical In-Context Policy Optimization for Self-Improvement

Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang

TL;DR

Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time and ensures the robustness of the self-assessed rewards via majority voting, is proposed.

Abstract

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.

Provable and Practical In-Context Policy Optimization for Self-Improvement

TL;DR

Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time and ensures the robustness of the self-assessed rewards via majority voting, is proposed.

Abstract

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.
Paper Structure (60 sections, 21 theorems, 155 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 60 sections, 21 theorems, 155 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Assume both teacher and student use $\gamma$-mixture exploration with $\gamma\in(0,1)$ as described in equation eq:pol, and let $N$ denote the trajectory length of the sample inside the expectation. Then,

Figures (4)

  • Figure 1: The In-Context Policy Optimization (ICPO) framework. At each round $t$, the agent leverages its history of past attempts with bandit feedback $\{(\mathbf{x}_1, r_1), \dots, (\mathbf{x}_t, r_t)\}$ to improve its response $\mathbf{x}_{t+1}$ in order to maximize the received reward $r_t$.
  • Figure 2: Validation of ICPO theory. (Top): Policy Matching. (Bottom): Reward-Shock Stability.
  • Figure 3: Performance comparison of backbone models before and after ME-ICPO.
  • Figure 4: Hyperparameter sensitivity of ME-ICPO on AIME 2024 with Qwen2.5-Math-7B.

Theorems & Definitions (25)

  • Theorem 4.1: mixed-policy KL is controlled by the Fisher-projected quadratic loss
  • Theorem 4.2: Population Equivalence
  • Theorem 4.3: Finite sample result
  • Remark 4.4
  • Remark 4.5
  • Remark 4.6
  • Definition 4.7: $s$-CRN coupled trajectories
  • Theorem 4.8: Stability to One-step Reward Perturbations
  • Lemma A.1: Mixture curvature and Fisher bounds on ${\bm 1}^\perp$
  • Lemma A.2: Softmax is $1/2$-Lipschitz on ${\bm 1}^\perp$
  • ...and 15 more