Table of Contents
Fetching ...

Long-Short Transformer: Efficient Transformers for Language and Vision

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

TL;DR

The paper tackles the quadratic complexity of self-attention in transformers when processing long sequences by introducing the Long-Short Transformer (Transformer-LS), which combines a dynamic long-range attention with segment-wise short-term window attention to achieve linear time and memory. A DualLN normalization is proposed to align the scales of the two attention components, enabling effective fusion for both autoregressive and bidirectional modeling. The approach yields state-of-the-art or competitive results across language and vision tasks, including Long Range Arena, enwik8/text8 language modeling, and ImageNet classification, while substantially reducing computational requirements. The method demonstrates strong cross-domain generalization and scalability to high-resolution inputs, with released code and models for broader use.

Abstract

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. The source code and models are released at https://github.com/NVIDIA/transformer-ls .

Long-Short Transformer: Efficient Transformers for Language and Vision

TL;DR

The paper tackles the quadratic complexity of self-attention in transformers when processing long sequences by introducing the Long-Short Transformer (Transformer-LS), which combines a dynamic long-range attention with segment-wise short-term window attention to achieve linear time and memory. A DualLN normalization is proposed to align the scales of the two attention components, enabling effective fusion for both autoregressive and bidirectional modeling. The approach yields state-of-the-art or competitive results across language and vision tasks, including Long Range Arena, enwik8/text8 language modeling, and ImageNet classification, while substantially reducing computational requirements. The method demonstrates strong cross-domain generalization and scalability to high-resolution inputs, with released code and models for broader use.

Abstract

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. The source code and models are released at https://github.com/NVIDIA/transformer-ls .

Paper Structure

This paper contains 30 sections, 7 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Long-short term attention of a single attention head. Here, the sequence length $n=8$, hidden dimension $d=3$, local window segment size $w=2$, and rank of dynamic projection $r=3$. Within the figure, $K(V)$ denotes key $K$ or value $V$. In the left figure, we virtually replicate $K$ or $V \in {\mathbb{R}}^{n\times d}$ into $n$ rows, and highlight the keys and values within the attention span (denoted as $\tilde{K} (\tilde{V})$) of all $n$ queries $Q$ for the short-term attention. In the middle figure, all queries attend to the same projected keys $\bar{K}$ and values $\bar{V}$ within the long-term attention. In the right figure, $\tilde{K}(\tilde{V})$ and $\bar{K}(\bar{V})$ are first normalized with two sets of LayerNorms, and the queries attend to normalized $\tilde{K}(\tilde{V})$ and $\bar{K}(\bar{V})$ within their attention span simultaneously.
  • Figure 2: Left: Ratios of the average $\ell_2$ norms of the local window to global low-rank key/value embeddings at initialization. Without DualLN, the sparse and low-rank embeddings have a magnitude mismatch. With DualLN, the ratios will be $1.0$ at every layer, which will facilitate optimization. Right: The validation loss of Transformer-LS with and without DualLN on enwik8 and text8.
  • Figure 3: Running time and memory consumption of Transformer-XL (full attention) and our Transformer-LS on Char-LM. We increase the sequence length until we use up the 32GB of memory on a V100 GPU. Transformer-LS is the same smaller model in Table \ref{['tbl:charlm']}. We use dashed lines to represent the full attention Transformer and solid lines to represent our model. We use different colors to represent different batch sizes.
  • Figure 4: An illustration of effective attention span (colored regions) in Transformer-LS when the segment size for the low-rank attention is $\ell=4$, and the segment size for the sliding window attention is $w=2$. Left: the attention span of only the low-rank attention (segment-wise dynamic projection). Right: the attention span of the aggregated attention.
  • Figure 5: An illustration of our sliding window attention in 1D autoregressive and bidirectional models. Here, we use a group size $w=2$. Each token inside each group are restricted to attend to at most $2w$ tokens. In the bidirectional model, they attend to $w$ tokens from the home segment, and $w/2$ tokens to the left and right of the home segment respectively. In the autoregressive model, they attend to $w$ tokens to the left of the home segment, as well as all tokens within the home segment that is not a future token.
  • ...and 2 more figures