Long Context In-Context Compression by Getting to the Gist of Gisting
Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov
TL;DR
This paper investigates long-context processing in decoder-only transformers and finds that the original Gist approach, which uses gist tokens and an attention bottleneck, fails to scale to long contexts. A simple average pooling baseline unexpectedly outperforms Gist, prompting the authors to propose GistPool, which combines shifting activations, separate compression parameters, and a pooling-biased attention mask to preserve the simplicity of Gist while boosting long-context performance. Theoretical and empirical analyses show that standard attention cannot reliably support copying or mean pooling at long lengths unless a restricted masking strategy is used, and GistPool provides a practical solution that achieves near lossless performance at low compression and strong gains at higher compression across multiple datasets and model scales. The results highlight that simpler pooling-based strategies can rival or exceed learned compression methods, with larger models especially benefiting from GistPool’s inductive bias and architecture-consistent design, promising scalable long-context inference for real-world LLM deployments.
Abstract
Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.
