Table of Contents
Fetching ...

QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory

Yihang Wang, Xu Huang, Bowen Tian, Yueyang Su, Lei Yu, Huaming Liao, Yixing Fan, Jiafeng Guo, Xueqi Cheng

TL;DR

This work tackles the long-context bottleneck in in-context learning for large language models by formulating context compression through Information Bottleneck (IB) theory. It proposes QUITO-X, an IB-based framework that uses cross-attention as a practical proxy for mutual information to select the most query-relevant tokens under a fixed compression ratio, with lexical merging and Gaussian smoothing to preserve semantic integrity. The authors prove that maximizing the mutual information $I_Q(ar{X}; Y)$ is equivalent to maximizing the conditional likelihood $\mathbb{E}[\log P(Y|\bar{X},Q)]$, and demonstrate state-of-the-art performance across nine long-context benchmarks, achieving up to a 25% improvement in compression rate while maintaining or even exceeding full-context accuracy in some cases. The method reduces memory usage and inference latency while delivering robust results on multi-doc QA, few-shot QA, and summarization tasks, highlighting its practical impact for scalable long-context reasoning. $I$-based context compression thus offers a principled, efficient path to deploy powerful LLMs in scenarios with extremely long inputs.

Abstract

Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.

QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory

TL;DR

This work tackles the long-context bottleneck in in-context learning for large language models by formulating context compression through Information Bottleneck (IB) theory. It proposes QUITO-X, an IB-based framework that uses cross-attention as a practical proxy for mutual information to select the most query-relevant tokens under a fixed compression ratio, with lexical merging and Gaussian smoothing to preserve semantic integrity. The authors prove that maximizing the mutual information is equivalent to maximizing the conditional likelihood , and demonstrate state-of-the-art performance across nine long-context benchmarks, achieving up to a 25% improvement in compression rate while maintaining or even exceeding full-context accuracy in some cases. The method reduces memory usage and inference latency while delivering robust results on multi-doc QA, few-shot QA, and summarization tasks, highlighting its practical impact for scalable long-context reasoning. -based context compression thus offers a principled, efficient path to deploy powerful LLMs in scenarios with extremely long inputs.

Abstract

Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
Paper Structure (38 sections, 22 equations, 10 figures, 6 tables)

This paper contains 38 sections, 22 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Comparison of our method and baseline approaches for preserving key information in model responses. Our method effectively retains critical context ("Thief"), ensuring accurate interpretation, while baseline methods fail to do so.
  • Figure 2: LLMLingua2 overly focuses on high-entropy nouns like 'barn' and 'farmhouse,' while neglecting relational words (e.g., 'near') and verbs, resulting in highly fragmented compression and leading to incorrect answers ('on a farm'). In contrast, QUITO-X retains key relational phrases ('in a barn near a farmhouse'), preserving full meaning and yielding the correct answer.
  • Figure 3: Overview of the proposed method for extracting cross-attention scores using a T5 model. The figure illustrates the process of filtering the context to retain the most relevant information for answering a specific query.
  • Figure 4: Ablation study results on four datasets (CoQA, Quoref, DROP, SQuAD) under three compression ratios (0.25, 0.5, 0.75). The top row shows the impact of the Gaussian filter on accuracy and information coverage, demonstrating consistent improvements across all datasets and compression ratios. The bottom row illustrates the effect of the merging module, highlighting its importance in recovering meaningful representations, particularly under higher compression ratios.
  • Figure 5: MRR results
  • ...and 5 more figures