Table of Contents
Fetching ...

Dodo: Dynamic Contextual Compression for Decoder-only LMs

Guanghui Qin, Corby Rosset, Ethan C. Chau, Nikhil Rao, Benjamin Van Durme

TL;DR

Transformer models struggle with long contexts due to quadratic self-attention cost. Dodo introduces dynamic contextual compression, representing text with a variable number of nuggets per layer to dramatically reduce decoding overhead. It supports both autoregressive language modeling and fixed-context compression, with a learning framework based on a straight-through estimator and LoRA-based fine-tuning. Experiments show Dodo retains language modeling, QA, and summarization capabilities at up to 20x compression, and often matches or exceeds baselines under compression, suggesting practical pathways to longer-context LLMs. The approach is complementary to existing long-context strategies and highlights clausal delimiters as a prominent nugget type.

Abstract

Transformer-based language models (LMs) are inefficient in long contexts. We propose Dodo, a solution for context compression. Instead of one vector per token in a standard transformer model, Dodo represents text with a dynamic number of hidden states at each layer, reducing the cost of self-attention to a fraction of typical time and space. Moreover, off-the-shelf models such as LLaMA can be adapted to Dodo by efficient parameter tuning methods such as LoRA. In use, Dodo can act as either an autoregressive LM or a context compressor for downstream tasks. We demonstrate through experiments in language modeling, question answering, and summarization that Dodo retains capabilities in these tasks, while drastically reducing the overhead during decoding. For example, in the autoencoding task, Dodo shrinks context at a 20x compression ratio with a BLEU score of 98% for reconstruction, achieving nearly lossless encoding.

Dodo: Dynamic Contextual Compression for Decoder-only LMs

TL;DR

Transformer models struggle with long contexts due to quadratic self-attention cost. Dodo introduces dynamic contextual compression, representing text with a variable number of nuggets per layer to dramatically reduce decoding overhead. It supports both autoregressive language modeling and fixed-context compression, with a learning framework based on a straight-through estimator and LoRA-based fine-tuning. Experiments show Dodo retains language modeling, QA, and summarization capabilities at up to 20x compression, and often matches or exceeds baselines under compression, suggesting practical pathways to longer-context LLMs. The approach is complementary to existing long-context strategies and highlights clausal delimiters as a prominent nugget type.

Abstract

Transformer-based language models (LMs) are inefficient in long contexts. We propose Dodo, a solution for context compression. Instead of one vector per token in a standard transformer model, Dodo represents text with a dynamic number of hidden states at each layer, reducing the cost of self-attention to a fraction of typical time and space. Moreover, off-the-shelf models such as LLaMA can be adapted to Dodo by efficient parameter tuning methods such as LoRA. In use, Dodo can act as either an autoregressive LM or a context compressor for downstream tasks. We demonstrate through experiments in language modeling, question answering, and summarization that Dodo retains capabilities in these tasks, while drastically reducing the overhead during decoding. For example, in the autoencoding task, Dodo shrinks context at a 20x compression ratio with a BLEU score of 98% for reconstruction, achieving nearly lossless encoding.
Paper Structure (44 sections, 16 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 44 sections, 16 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Dodo efficiently maps long inputs into a compressed set of vectors named nuggets , which can then be attended to when processing a query.
  • Figure 2: An illustration of the autoregressive Dodo , where $\mathtt{Scorer}(\varphi)$ selects nuggets tokens, $\mathtt{Dodo}(\phi)$ aggregates the information of $(t-\tau)$ distant tokens into nuggets . When predicting a new token, the $\mathtt{LM}(\theta)$ has direct access to recent $\tau$ tokens but needs to use nuggets to access the distant information.
  • Figure 3: Dodo as context compressor. From left to right, Encoder side: $\mathtt{Dodo}_\phi$ encodes texts into vectors representations; Scorer: $\mathtt{Scorer}_\varphi$ computes a score for eaceh encoder token and then select the top-$k$ tokens as nuggets ; Decoder side: Language model $\texttt{LM}_\theta$ autoretressively decodes text conditioned on nuggets .
  • Figure 4: BLEU scores for autoencoding. Each group corresponds to a sequence length ($\pm 5$ tokens). Note the performance of ICAE is nearly 100% for sequence lengths shorter than 300.
  • Figure 5: Token frequency of tokens selected by Dodo and the formal texts. These top 10 token types cover 95% of the observed selection.
  • ...and 2 more figures