Table of Contents
Fetching ...

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

Jun Zhang, Shuyang Jiang, Jiangtao Feng, Lin Zheng, Lingpeng Kong

TL;DR

CAB introduces a fine-grained attention taxonomy with four patterns (NS, CS, NC, CC) and a Comprehensive Attention Benchmark across seven real-world tasks and eight backbones to evaluate efficient attention architectures. It demonstrates that current efficient attentions often match vanilla performance in noncausal self but struggle with cross and causal patterns, highlighting pattern-specific limitations and the need for cross-pattern generalization. The framework defines a Compositional Index to unify metrics, analyzes efficiency length to quantify practical gains, and investigates interpolation vs extrapolation in long-context language modeling, offering insights into when and how efficient attentions scale. Overall, CAB provides a pattern-aware, cross-domain evaluation that guides the design of next-generation attention mechanisms for long-sequence modeling and long-context generation.

Abstract

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

TL;DR

CAB introduces a fine-grained attention taxonomy with four patterns (NS, CS, NC, CC) and a Comprehensive Attention Benchmark across seven real-world tasks and eight backbones to evaluate efficient attention architectures. It demonstrates that current efficient attentions often match vanilla performance in noncausal self but struggle with cross and causal patterns, highlighting pattern-specific limitations and the need for cross-pattern generalization. The framework defines a Compositional Index to unify metrics, analyzes efficiency length to quantify practical gains, and investigates interpolation vs extrapolation in long-context language modeling, offering insights into when and how efficient attentions scale. Overall, CAB provides a pattern-aware, cross-domain evaluation that guides the design of next-generation attention mechanisms for long-sequence modeling and long-context generation.

Abstract

Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
Paper Structure (54 sections, 19 equations, 8 figures, 15 tables)

This paper contains 54 sections, 19 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Computation diagrams for (a) noncausal self, (b) causal self, (c) noncausal cross, (d) causal cross attentions. Shaded blocks represent future tokens that are invisible to the current state. Blocks with red rims represent the current state token.
  • Figure 2: Empirical running time (a) and memory cost (b) with sequence length. Relative measurements to vanilla attention are reported. (c) efficiency length of attention architectures compared to vanilla attention. The efficient attention order on the y-axis is monotonically sorted by efficiency length on memeory usage.
  • Figure 3: (a) Pairwise attention pattern correlation where NS, CS, NC, and CC signify noncausal self, causal self, noncausal cross, and causal cross attention respectively; (b) Ablation study of removing attention. The gray part is the score achieved by the existing efficient attention family, where we select the most well-performed efficient attention for each task; (c) the performance of efficient attentions based on different context lengths in the LM task during the test phase. S4D fails on the context with 16,384 tokens.
  • Figure 4: Task-to-task correlation.
  • Figure 5: Model parameters on noncausal self pattern. "FS2" denotes FastSpeech2 and "Tr" denotes Transformer for brevity.
  • ...and 3 more figures