QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory
Yihang Wang, Xu Huang, Bowen Tian, Yueyang Su, Lei Yu, Huaming Liao, Yixing Fan, Jiafeng Guo, Xueqi Cheng
TL;DR
This work tackles the long-context bottleneck in in-context learning for large language models by formulating context compression through Information Bottleneck (IB) theory. It proposes QUITO-X, an IB-based framework that uses cross-attention as a practical proxy for mutual information to select the most query-relevant tokens under a fixed compression ratio, with lexical merging and Gaussian smoothing to preserve semantic integrity. The authors prove that maximizing the mutual information $I_Q(ar{X}; Y)$ is equivalent to maximizing the conditional likelihood $\mathbb{E}[\log P(Y|\bar{X},Q)]$, and demonstrate state-of-the-art performance across nine long-context benchmarks, achieving up to a 25% improvement in compression rate while maintaining or even exceeding full-context accuracy in some cases. The method reduces memory usage and inference latency while delivering robust results on multi-doc QA, few-shot QA, and summarization tasks, highlighting its practical impact for scalable long-context reasoning. $I$-based context compression thus offers a principled, efficient path to deploy powerful LLMs in scenarios with extremely long inputs.
Abstract
Generative LLM have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the "lost in the middle" problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or PPL, which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
