IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Yinhan He; Yaochen Zhu; Mingjia Shi; Wendy Zheng; Lin Su; Xiaoqing Wang; Qi Guo; Jundong Li

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang, Qi Guo, Jundong Li

TL;DR

This work proposes IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information with the final answer, providing an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration.

Abstract

Large language models increasingly rely on long chains of thought to improve accuracy, yet such gains come with substantial inference-time costs. We revisit token-efficient post-training and argue that existing sequence-level reward-shaping methods offer limited control over how reasoning effort is allocated across tokens. To bridge the gap, we propose IAPO, an information-theoretic post-training framework that assigns token-wise advantages based on each token's conditional mutual information (MI) with the final answer. This yields an explicit, principled mechanism for identifying informative reasoning steps and suppressing low-utility exploration. We provide a theoretical analysis showing that our IAPO can induce monotonic reductions in reasoning verbosity without harming correctness. Empirically, IAPO consistently improves reasoning accuracy while reducing reasoning length by up to 36%, outperforming existing token-efficient RL methods across various reasoning datasets. Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training. The code is available at https://github.com/YinhanHe123/IAPO.

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

TL;DR

Abstract

Paper Structure (38 sections, 2 theorems, 29 equations, 28 figures, 5 tables)

This paper contains 38 sections, 2 theorems, 29 equations, 28 figures, 5 tables.

Introduction
Preliminaries and Problem Definition
Proposed Methodology
Overview
Information-Aware Advantage Shaping Module
Informativeness Level of a Token
Token Exploration Adjustments
Token-wise Advantage Assignment
Efficient Conditional MI Estimation Module
Early-exit-based Conditional MI Estimator
Training Acceleration Techniques
Theoretical Analysis
Completion Lengths Reduction
Exploration Adjustment
Empirical Study
...and 23 more sections

Key Result

Theorem 4.1

Given an LLM $\pi_0$, let $L_{\text{GRPO}}$ and $L_{\text{IAPO}}$ denote the expected completion lengths under $\pi_{\text{GRPO}}$ and $\pi_{\text{IAPO}}$, which are one-step updated policy models given by GRPO and IAPO upon the original policy $\pi_0$, respectively. For sufficiently small step size where $S(o)$ is the informativeness-weighted accumulated token-level gradient induced by the IAPO a

Figures (28)

Figure 1: Reasoning verbosity of RL post-trained LLMs. (a) Comparison of reasoning length between LLM (DeepSeekR1-Distilled-Qwen-1.5B deepseekai2025deepseekr1incentivizingreasoningcapability) and a human volunteer on math problems lightman2023lets. (b) Illustration of why the reasoning generated by the LLM are unnecessarily verbose.
Figure 2: Illustration of the information-aware advantage shaping module, where $s_{i,j}$ and $c_{i,j}$ are token-wise advantages of the informativeness level and exploration adjustment of the $j$th token in the $i$th completion $o_i$ of the completions group $\{o_i\}_{i=1}^G$.
Figure 3: Illustration of early-exit-based conditional MI estimator. We highlight "$o_{\textcolor{red}{<}t}$" and "$o_{\textcolor{red}{\leq}t}$" to emphasis the inclusion and exclusion of the current examined token $o_t$ in the partial completions.
Figure 4: Illustration of the naive implementation and KV cache preloading technique in conditional MI estimation. We highlight the key, query, and values of the prompt postfix in red. We show the time complexities in the right, where $K$ is the length of the prompt postfix, $N$ is $|o_i|$, $L$ is $|q|+|o_i|$, and $d$ is the embedding dimension of the LLM. The technique is significantly faster than the naive implementation since the condition $K\ll L$ holds.
Figure 5: Ablation Study for IAPO.
...and 23 more figures

Theorems & Definitions (4)

Theorem 4.1
Corollary 4.2
proof
proof

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

TL;DR

Abstract

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (28)

Theorems & Definitions (4)