Table of Contents
Fetching ...

Radar: Fast Long-Context Decoding for Any Transformer

Yongchang Hao, Mengyao Zhai, Hossein Hajimirsadeghi, Sepidehsadat Hosseini, Frederick Tung

TL;DR

Radar is proposed, a training-free approach that accelerates inference by dynamically searching for the most important context tokens in Transformer models, offering a practical solution for efficient long-context processing of Transformers.

Abstract

Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

Radar: Fast Long-Context Decoding for Any Transformer

TL;DR

Radar is proposed, a training-free approach that accelerates inference by dynamically searching for the most important context tokens in Transformer models, offering a practical solution for efficient long-context processing of Transformers.

Abstract

Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

Paper Structure

This paper contains 41 sections, 8 theorems, 24 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 0

Let ${\bm{\phi}}_{\bm{\Omega}}$ follow the definition eq:random-feature where ${\bm{\Omega}} = ({\bm\omega}_1, \dots, {\bm\omega}_n)$ is sampled from ${\mathcal{N}}_{\bm{0}, \bm{1}}$. Given any ${\bm{u}}, {\bm{v}} \in {\mathbb{R}}^d$, we have $\mathop{\mathbb{E}}_{\bm{\Omega}}[{\bm{\phi}}^\top_{\bm{

Figures (7)

  • Figure 1: Overview of the approach.
  • Figure 2: The performance comparison in perplexity (first row) and elapsed time (second row). The lower the better for both metrics. For the Llama model, we annotate the perplexity value at the last token; for the Mistral model, we annotate the perplexity at the maximum pre-training context length (shown by the vertical dashed lines) because the full context exceeds its modeling ability. We additionally show the generation throughput for all runs.
  • Figure 3: Generation without prompts.
  • Figure 4: The effect of the hyper-parameters $n$ (the projection dimension for the random matrix) and $k$ (the number of top segments selected) introduced by Radar.
  • Figure 5: Ablation studies. Here, we compare with three different segment selection strategies.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Lemma 0
  • proof
  • Theorem 1
  • proof
  • Lemma 2: adapted from fan2015exponential
  • proof
  • Lemma 3
  • proof
  • Definition 4
  • Lemma 5
  • ...and 8 more