Table of Contents
Fetching ...

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Ziwei He, Jian Yuan, Le Zhou, Jingwen Leng, Bo Jiang

TL;DR

Transformers' self-attention typically scales as $O(N^2)$ with sequence length, hindering long-context processing. The paper introduces Fovea Transformer, which employs a bottom-up multi-scale representation tree and a fovea attention mechanism that progressively attends to coarser context as distance from the query grows, achieving $O(N \log N)$ attention complexity. It requires no additional parameters and can be used as a drop-in replacement for standard attention, warm-started from LongT5, and evaluated on three long-context abstractive summarization datasets where it achieves state-of-the-art on two tasks. The approach enables smoother granularity transitions and practical efficiency for long-document summarization in pretrained transformers.

Abstract

The quadratic complexity of self-attention in Transformers has hindered the processing of long text. To alleviate this problem, previous works have proposed to sparsify the attention matrix, taking advantage of the observation that crucial information about a token can be derived from its neighbors. These methods typically combine one or another form of local attention and global attention. Such combinations introduce abrupt changes in contextual granularity when going from local to global, which may be undesirable. We believe that a smoother transition could potentially enhance model's ability to capture long-context dependencies. In this study, we introduce Fovea Transformer, a long-context focused transformer that addresses the challenges of capturing global dependencies while maintaining computational efficiency. To achieve this, we construct a multi-scale tree from the input sequence, and use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases. We evaluate our model on three long-context summarization tasks\footnote{Our code is publicly available at: \textit{https://github.com/ZiweiHe/Fovea-Transformer}}. It achieves state-of-the-art performance on two of them, and competitive results on the third with mixed improvement and setback of the evaluation metrics.

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

TL;DR

Transformers' self-attention typically scales as with sequence length, hindering long-context processing. The paper introduces Fovea Transformer, which employs a bottom-up multi-scale representation tree and a fovea attention mechanism that progressively attends to coarser context as distance from the query grows, achieving attention complexity. It requires no additional parameters and can be used as a drop-in replacement for standard attention, warm-started from LongT5, and evaluated on three long-context abstractive summarization datasets where it achieves state-of-the-art on two tasks. The approach enables smoother granularity transitions and practical efficiency for long-document summarization in pretrained transformers.

Abstract

The quadratic complexity of self-attention in Transformers has hindered the processing of long text. To alleviate this problem, previous works have proposed to sparsify the attention matrix, taking advantage of the observation that crucial information about a token can be derived from its neighbors. These methods typically combine one or another form of local attention and global attention. Such combinations introduce abrupt changes in contextual granularity when going from local to global, which may be undesirable. We believe that a smoother transition could potentially enhance model's ability to capture long-context dependencies. In this study, we introduce Fovea Transformer, a long-context focused transformer that addresses the challenges of capturing global dependencies while maintaining computational efficiency. To achieve this, we construct a multi-scale tree from the input sequence, and use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases. We evaluate our model on three long-context summarization tasks\footnote{Our code is publicly available at: \textit{https://github.com/ZiweiHe/Fovea-Transformer}}. It achieves state-of-the-art performance on two of them, and competitive results on the third with mixed improvement and setback of the evaluation metrics.
Paper Structure (10 sections, 5 equations, 3 figures, 4 tables)

This paper contains 10 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Illustration for tree construction and fovea attention.
  • Figure 2: Examples of building blocks for fovea attention. Each subplot indicates the attention matrix masks between query and key for each level of the tree (the colors correspond). Suppose there are originally N blocks in the input, the number of blocks from higher level decreases through the tree merging. Colored entries means active of attention, white entries indicates absence instead.
  • Figure 3: The training speed and GPU memory consumptions of Fovea Transformer, LongT5 and T5. All the models are in large size with input length of 1k, 2k, 4k and 8k. Measurements taken with batch size 1 on 1$\times$4 A100-40 GPUs.