Table of Contents
Fetching ...

Sentence-Anchored Gist Compression for Long-Context LLMs

Dmitrii Tarasov, Elizaveta Goncharova, Kuznetsov Andrey

TL;DR

The paper tackles the bottleneck of long-context processing in LLMs by learning a set of sentence-anchored gist tokens that compress history into a compact conditioning context. It extends the LM with $N_g$ gist tokens, inserts them at sentence boundaries, and uses a modified attention mask to let the gist tokens aggregate information across segments while regular tokens attend locally. Training follows a three-stage, end-to-end LM objective without reconstruction losses, achieving $2\times$ to $8\times$ KV-cache compression with minimal degradation on short- and long-context benchmarks, including a 3B LLaMA model. This scalable approach offers a practical path to efficient long-context inference with modest architectural changes and competitive performance relative to existing compression methods.

Abstract

This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.

Sentence-Anchored Gist Compression for Long-Context LLMs

TL;DR

The paper tackles the bottleneck of long-context processing in LLMs by learning a set of sentence-anchored gist tokens that compress history into a compact conditioning context. It extends the LM with gist tokens, inserts them at sentence boundaries, and uses a modified attention mask to let the gist tokens aggregate information across segments while regular tokens attend locally. Training follows a three-stage, end-to-end LM objective without reconstruction losses, achieving to KV-cache compression with minimal degradation on short- and long-context benchmarks, including a 3B LLaMA model. This scalable approach offers a practical path to efficient long-context inference with modest architectural changes and competitive performance relative to existing compression methods.

Abstract

This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.

Paper Structure

This paper contains 27 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Comparison of attention mechanisms: (a) causal attention, and (b) sentence attention with $N_g=1$ gist token. Gist tokens ($g_1$) are inserted at sentence boundaries and pool information from their entire sentence. They are visible to all subsequent tokens, while regular tokens ($t_i$) only attend within their sentence.
  • Figure 2: PG19 perplexity for Sentence Llama3.2-3B ($N_g \in \{4,8\}$) compared to the base model across different prefix lengths. The "no Gist Tokens" curves represent perplexity calculated while excluding all but the final gist token in each segment.