Table of Contents
Fetching ...

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Fabio Valerio Massoli, Andrey Kuzmin, Arash Behboodi

TL;DR

A semantic prior is introduced that measures token cost by surprisal under a language model prior and prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.

Abstract

Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the Information Bottleneck (IB) principle, and identify a key theoretical gap when applying naive IB to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model CoT generation under the Conditional Information Bottleneck (CIB) principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y that is not directly accessible from the prompt X. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting-based approaches, we introduce a semantic prior that measures token cost by surprisal under a language model prior. Empirically, our CIB objective prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

TL;DR

A semantic prior is introduced that measures token cost by surprisal under a language model prior and prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.

Abstract

Chain-of-Thought (CoT) prompting improves LLM accuracy on complex tasks but often increases token usage and inference cost. Existing "Budget Forcing" methods reducing cost via fine-tuning with heuristic length penalties, suppress both essential reasoning and redundant filler. We recast efficient reasoning as a lossy compression problem under the Information Bottleneck (IB) principle, and identify a key theoretical gap when applying naive IB to transformers: attention violates the Markov property between prompt, reasoning trace, and response. To resolve this issue, we model CoT generation under the Conditional Information Bottleneck (CIB) principle, where the reasoning trace Z acts as a computational bridge that contains only the information about the response Y that is not directly accessible from the prompt X. This yields a general Reinforcement Learning objective: maximize task reward while compressing completions under a prior over reasoning traces, subsuming common heuristics (e.g., length penalties) as special cases (e.g., uniform priors). In contrast to naive token-counting-based approaches, we introduce a semantic prior that measures token cost by surprisal under a language model prior. Empirically, our CIB objective prunes cognitive bloat while preserving fluency and logic, improving accuracy at moderate compression and enabling aggressive compression with minimal accuracy drop.
Paper Structure (37 sections, 2 theorems, 33 equations, 12 figures, 3 tables)

This paper contains 37 sections, 2 theorems, 33 equations, 12 figures, 3 tables.

Key Result

Proposition 4.1

A standard length-based penalty (e.g., $g(Z) = \alpha f(|Z|)$) is equivalent to the CIB objective under the assumption of a maximum entropy (uniform) prior, $Q$, over the vocabulary.

Figures (12)

  • Figure 1: Pareto frontier for AIME24. The $\beta$ weight from CIB objective confers fine-grained control over the accuracy-compression trade-off. A stronger prior ($Q_\phi=7B$, yellow square) allows for stronger compression compared to a smaller one ($Q_\phi=1.5B$, blue circles). As a reference, we report the baseline model (DLER dler2025, red star), the L3L1-EXACT aggarwal2025l1 model snapshot (purple cross), and our implementation of L1-Exact length penalty from the same paper (green hexagon).
  • Figure 2: Minimality reward as a function of the completion length. We observe a consistent negative correlation between the completion length and the minimality reward used during RL training. The shadow blue region shows the $\pm1\sigma$ band representing the spread of the information cost for the token chosen within CoTs with similar length.
  • Figure 3: Lengths Distribution. Compared the baseline length distribution (blue curve), the minimality term shifts the length distribution towards shorter completions (green curve). The plotted distributions correspond to models with similar accuracy (within $\lesssim 1.4\%$ -- see \ref{['tab:results']}).
  • Figure 4: Meta-Generalization: Robustness Across Benchmarks. Efficiency gain of CIB across diverse benchmarks and models. Points falling in the upper half-plane ("Golden Zone") exhibit strictly superior efficiency, achieving higher information density with reduced computational cost.
  • Figure 5: Information Density Profile. Token-wise surprisal evaluated against the baseline language prior. A lower value of the surprisal corresponds to predictable linguistic filler and cognitive bloat. CIB models maintain a consistently higher information floor ($\gtrsim0.2$ nats) confirming that the compression is semantic rather than arbitrary.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof