InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Fanqi Kong; Jiayi Zhang; Mingyi Deng; Chenglin Wu; Yuyu Luo; Bang Liu

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

TL;DR

InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual.

Abstract

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

TL;DR

Abstract

Paper Structure (67 sections, 2 theorems, 39 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 67 sections, 2 theorems, 39 equations, 18 figures, 5 tables, 1 algorithm.

Introduction
Related Works
User-centric agents.
Agentic reinforcement learning.
Reward Shaping in RL.
Preliminaries
Multi-Turn Interaction as a Dec-POMDP
Optimization with Group-Relative Policy Gradient
Methods
Turn-level Counterfactual Information Gain
Unified Group-Relative Advantage Construction
Outcome Advantage.
Info-Gain Advantage.
Adaptive Fusion via Variance Gating.
Theory: Info-Gain as a Necessary Resource
...and 52 more sections

Key Result

Theorem 1

Let $H_t$ be the interaction history, $O_t$ the feedback, and $A_{t+1}$ the subsequent action. Defining the marginal policy $\pi_\theta(\cdot\mid H_t) \triangleq \mathbb{E}_{O_t\sim P(\cdot\mid H_t)} [\pi_\theta(\cdot\mid H_t,O_t)]$, the turn-level info-gain reward $r^{\mathrm{info}}_t$ defined in E

Figures (18)

Figure 1: Standard GRPO vs. InfoPO. Standard GRPO yields zero reward for "honorable failures" (correct elicitation, failed execution). InfoPO solves this via counterfactual masking to provide dense, turn-level information-gain rewards.
Figure 2: Overview of the InfoPO framework. It extracts a turn-level info-gain signal by counterfactual reasoning and adaptively fuses it with outcome-based advantages to facilitate efficient credit assignment in multi-turn user-centric tasks.
Figure 3: Extrinsic reward curves during training on (a) UserGym, (b) ColBench, and (c) $\tau^{2}$-Bench. Solid lines and shaded regions represent mean $\pm$ std across three seeds.
Figure 4: Mechanism diagnostics under ablations (aggregated across tasks). $J_f$ denotes final extrinsic performance; $\Delta_{\mathrm{bf}}$ is the best-to-final drop (late-training instability); $P_{\mathrm{cr}}$ is the probability of training collapse; $\bar{T}$ and $\bar{L}$ are the average interaction turns and response length; and $\rho_{L,r}$ is the correlation between response length and extrinsic reward (a proxy for length-based reward hacking). All metrics are converted to "higher-is-better" scores in the figure; formal definitions are in Appendix \ref{['app:task-metrics']}.
Figure 5: Interaction dynamics and reward signals. (a) Turns vs. response length; (b) Absolute info-gain (solid) and its advantage contribution ratio (dashed). See Appendix \ref{['app:results']} for more results.
...and 13 more figures

Theorems & Definitions (2)

Theorem 1: Equivalence to Mutual Information
Theorem 2: Necessity for Task Success

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

TL;DR

Abstract

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (2)