Provable and Practical In-Context Policy Optimization for Self-Improvement

Tianrun Yu; Yuxiao Yang; Zhaoyang Wang; Kaixiang Zhao; Porter Jenkins; Xuchao Zhang; Chetan Bansal; Huaxiu Yao; Weitong Zhang

Provable and Practical In-Context Policy Optimization for Self-Improvement

Tianrun Yu, Yuxiao Yang, Zhaoyang Wang, Kaixiang Zhao, Porter Jenkins, Xuchao Zhang, Chetan Bansal, Huaxiu Yao, Weitong Zhang

TL;DR

Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time and ensures the robustness of the self-assessed rewards via majority voting, is proposed.

Abstract

We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference. We introduce In-Context Policy Optimization (ICPO), in which an agent optimizes its response in context using self-assessed or externally observed rewards without modifying its parameters. To explain this ICPO process, we theoretically show that with sufficient pretraining under a novel Fisher-weighted logit-matching objective, a single-layer linear self-attention model can provably imitate policy-optimization algorithm for linear bandits. Building on this theory, we propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time. By selecting the responses and their rewards with minimum entropy, ME-ICPO ensures the robustness of the self-assessed rewards via majority voting. Across standard mathematical reasoning tasks, ME-ICPO attains competitive, top-tier performance while keeping inference costs affordable compared with other inference-time algorithms. Overall, ICPO provides a principled understanding of self-reflection in LLMs and yields practical benefits for test-time scaling for mathematical reasoning.

Provable and Practical In-Context Policy Optimization for Self-Improvement

TL;DR

Abstract

Paper Structure (60 sections, 21 theorems, 155 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 60 sections, 21 theorems, 155 equations, 4 figures, 8 tables, 1 algorithm.

Introduction
Notation.
Related Work
Test-Time Scaling.
Self-reflection and self-assessment.
In-Context Learning and In-Context Reinforcement Learning.
Preliminaries
Policy Optimization Framework.
Supervised Pretraining Data Generation.
Linear Self-Attention (LSA).
Theoretical Framework for ICPO
The ICPO Forward Pass.
Training Objective.
Theoretical Guarantees for ICPO
Minimum-Entropy In-Context Policy Optimization
...and 45 more sections

Key Result

Theorem 4.1

Assume both teacher and student use $\gamma$-mixture exploration with $\gamma\in(0,1)$ as described in equation eq:pol, and let $N$ denote the trajectory length of the sample inside the expectation. Then,

Figures (4)

Figure 1: The In-Context Policy Optimization (ICPO) framework. At each round $t$, the agent leverages its history of past attempts with bandit feedback $\{(\mathbf{x}_1, r_1), \dots, (\mathbf{x}_t, r_t)\}$ to improve its response $\mathbf{x}_{t+1}$ in order to maximize the received reward $r_t$.
Figure 2: Validation of ICPO theory. (Top): Policy Matching. (Bottom): Reward-Shock Stability.
Figure 3: Performance comparison of backbone models before and after ME-ICPO.
Figure 4: Hyperparameter sensitivity of ME-ICPO on AIME 2024 with Qwen2.5-Math-7B.

Theorems & Definitions (25)

Theorem 4.1: mixed-policy KL is controlled by the Fisher-projected quadratic loss
Theorem 4.2: Population Equivalence
Theorem 4.3: Finite sample result
Remark 4.4
Remark 4.5
Remark 4.6
Definition 4.7: $s$-CRN coupled trajectories
Theorem 4.8: Stability to One-step Reward Perturbations
Lemma A.1: Mixture curvature and Fisher bounds on ${\bm 1}^\perp$
Lemma A.2: Softmax is $1/2$-Lipschitz on ${\bm 1}^\perp$
...and 15 more

Provable and Practical In-Context Policy Optimization for Self-Improvement

TL;DR

Abstract

Provable and Practical In-Context Policy Optimization for Self-Improvement

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (25)