A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou
TL;DR
This work tackles the challenge of long-context processing in Transformer-based LLMs by studying gist-token based context compression, which replaces long histories with a small set of gist tokens. It introduces a unified taxonomy by memory location and gist granularity, and empirically shows that fine-grained KV-compression can achieve near-lossless performance on many tasks (e.g., retrieval-augmented generation and long-document QA) while revealing notable gaps on synthetic recall and reranking. Through probing, the authors identify three failure patterns—lost by the boundary, lost if surprise, and lost along the way—and propose two mitigation strategies: fine-grained autoencoding and segment-wise token importance estimation, which together yield substantial performance gains, especially on challenging long-context tasks. The findings provide concrete guidance for deploying gist-based compression and point to future directions, including scaling to larger models and exploring broader compression mechanisms to extend context windows without full attention.
Abstract
In this work, we provide a thorough investigation of gist-based context compression methods to improve long-context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.
