Table of Contents
Fetching ...

Long Context In-Context Compression by Getting to the Gist of Gisting

Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov

TL;DR

This paper investigates long-context processing in decoder-only transformers and finds that the original Gist approach, which uses gist tokens and an attention bottleneck, fails to scale to long contexts. A simple average pooling baseline unexpectedly outperforms Gist, prompting the authors to propose GistPool, which combines shifting activations, separate compression parameters, and a pooling-biased attention mask to preserve the simplicity of Gist while boosting long-context performance. Theoretical and empirical analyses show that standard attention cannot reliably support copying or mean pooling at long lengths unless a restricted masking strategy is used, and GistPool provides a practical solution that achieves near lossless performance at low compression and strong gains at higher compression across multiple datasets and model scales. The results highlight that simpler pooling-based strategies can rival or exceed learned compression methods, with larger models especially benefiting from GistPool’s inductive bias and architecture-consistent design, promising scalable long-context inference for real-world LLM deployments.

Abstract

Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.

Long Context In-Context Compression by Getting to the Gist of Gisting

TL;DR

This paper investigates long-context processing in decoder-only transformers and finds that the original Gist approach, which uses gist tokens and an attention bottleneck, fails to scale to long contexts. A simple average pooling baseline unexpectedly outperforms Gist, prompting the authors to propose GistPool, which combines shifting activations, separate compression parameters, and a pooling-biased attention mask to preserve the simplicity of Gist while boosting long-context performance. Theoretical and empirical analyses show that standard attention cannot reliably support copying or mean pooling at long lengths unless a restricted masking strategy is used, and GistPool provides a practical solution that achieves near lossless performance at low compression and strong gains at higher compression across multiple datasets and model scales. The results highlight that simpler pooling-based strategies can rival or exceed learned compression methods, with larger models especially benefiting from GistPool’s inductive bias and architecture-consistent design, promising scalable long-context inference for real-world LLM deployments.

Abstract

Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.

Paper Structure

This paper contains 40 sections, 12 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: In-context compression methods. We illustrate compressing a story with $\xi{=}2$ compression rate, using a 2-layer transformer. At inference time, the model first compresses the context and then autoregressively samples an answer based on the compressed context and the query. Both parts are used for training. The color of the activations and the mask rows (yellow/green) correspond to different sets of model parameters. All tokens attend also to the BOS token which is not shown. a) The Full context baseline, i.e., finetuning the base model. b) The No context baseline, simulating the worst-case performance when the context is fully destroyed by the compression. c) The original Gist setup where the added gist tokens can attend to the context but the query and answer tokens can only attend to the gist tokens. d)AvgPool where we average pool the model activations every two tokens and use that for the prediction stage. For presentation purposes, we illustrate the pooled values as extra tokens. e)OffsetGist, a variant of Gist where the compressed activations at the gist positions are shifted one layer down prior to prediction to make the compressed activations immediately available to the next layer. f)SepOffsetGist, which is equivalent to OffsetGist, except the compressed activations are computed with a separate set of model parameters. g)GistPool, our proposed in-context compression method. The key features of GistPool are: (1) shifting the activations down by one layer during the prediction phase; (2) compression-specific parameters are separated from the other model parameters; (3) spreading out the tokens uniformly across the context and modifying the mask. For illustration purposes, a mask attending to the previous two pooling windows is shown but the experiments are performed with a mask attending to the previous 5 windows (see \ref{['sec:spreading_the_tokens']} for details).
  • Figure 2: Final evaluation loss and Gemini Score (number of errors as determined by Gemini Judge). Full context baseline (lower) and No context baseline (higher) are indicated with dashed black lines (shaded between). If the No context baseline is significantly worse than other methods, we omit it for clarity and instead show its value. Lower values are better for both metrics.
  • Figure 3: The gist tokens delay the information flow.a) In the base model, the activations of layer $i$ are the query position inputs of layer $i+1$. b) The summaries introduced with Gist become the activations at the gist positions at layer $i+1$, which in turn become the query position inputs at layer $i+2$, one layer later than the model expects the information from layer $i$. c) By shifting the gist activations one layer down for the prediction stage, the summarized context from layer $i$ is available as input to the query positions at layer $i+1$ matching the expectation of the base model.
  • Figure 4: Attention might learn average pooling for fixed context size with standard mask but requires pool mask in the variable context size case. Shown are the learned attention weights for mean pooling Gemma embeddings, context length 256 and compression rate 8. Only the attention weights for the gist positions attending to the context positions are shown. Pool mask forces each gist token to attend to its corresponding group. However standard mask cannot learn average pooling in the variable context size case, as seen by the dispersed attention in c).
  • Figure 5: Comparison of desired vs. actual context compression rates using Gemini Compress. We do not have a direct way to control the length of the generated summaries, hence they have a distribution of lengths rather than fixed lengths. Nevertheless, the mean compression rates are close to the target compression rates.