Table of Contents
Fetching ...

Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati

TL;DR

This paper tackles the challenge of long-context processing beyond pre-trained window sizes in transformers. It introduces REFORM, a two-phase pipeline that combines recurrent chunked forwarding with a compressed KV cache and an on-demand cache recomputation via similarity-based token gathering, achieving high retrieval fidelity with reduced resource use. Empirically, REFORM delivers substantial gains on long-context benchmarks (e.g., over 52% on RULER and 34% on BABILong at 1M tokens) and outperforms baselines on ∞-bench, RepoEval, and MM-NIAH, while reducing inference time and memory relative to competing methods. The approach is modality-agnostic and scalable across domains, enabling practical deployment for extremely long contexts with analyzed ablations and efficiency metrics, including a theoretical complexity alignment of $O(L)$ time for recurrent processing and memory proportional to token embeddings.

Abstract

As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 52% and 34% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench, RepoEval, and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.

Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

TL;DR

This paper tackles the challenge of long-context processing beyond pre-trained window sizes in transformers. It introduces REFORM, a two-phase pipeline that combines recurrent chunked forwarding with a compressed KV cache and an on-demand cache recomputation via similarity-based token gathering, achieving high retrieval fidelity with reduced resource use. Empirically, REFORM delivers substantial gains on long-context benchmarks (e.g., over 52% on RULER and 34% on BABILong at 1M tokens) and outperforms baselines on ∞-bench, RepoEval, and MM-NIAH, while reducing inference time and memory relative to competing methods. The approach is modality-agnostic and scalable across domains, enabling practical deployment for extremely long contexts with analyzed ablations and efficiency metrics, including a theoretical complexity alignment of time for recurrent processing and memory proportional to token embeddings.

Abstract

As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 52% and 34% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench, RepoEval, and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.

Paper Structure

This paper contains 32 sections, 2 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overview of the proposed framework. REFORM efficiently processes long inputs through two phases. In the recurrent chunked forwarding phase, it segments inputs into chunks and processes them iteratively. In each iteration, REFORM (1) forwards each chunk conditioned on the previous KV cache, (2) extracts key QKV states from selected layers and heads for constructing cross-layer context embeddings, and (3) compresses the cache via token eviction zhang2023h2o. An early exit strategy skips upper layers beyond those used for embedding collection, further improving efficiency. In the on-demand cache recomputation phase, REFORM selects important tokens via similarity search with the query embeddings (last part of the input), gathers them, and recomputes the KV cache for further generation.
  • Figure 2: MNR Scores for Value Heads. The distribution of the MNR scores (lower is better) across value states of different attention heads, measured by Mistral-Nemo-Instruct-2407 model for 500 synthetic multi-hop QA examples. 256-token heavy hitter budget was used for computation.
  • Figure 3: Needle-In-A-Haystack Evaluation. We visualize the retrieval accuracy of Qwen2.5-7B-Instruct at different depth and context lengths. Performance is averaged over 20 samples.
  • Figure 4: MNR Scores for Value Heads. The distribution of the MNR scores (lower is better) across value states of different attention heads, measured by Mistral-Nemo-Instruct-2407 model over 500 synthetic pattern matching examples. Recurrent chunked forwarding with 256-token heavy hitter budget was employed for computing the embeddings.