Table of Contents
Fetching ...

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, Zhouhan Lin

TL;DR

This work tackles the quadratic cost of self-attention for long sequences by introducing Fourier Transformer, which down-samples hidden states through a 1D Discrete Cosine Transform implemented via FFT, while preserving compatibility with pretrained weights. By inserting spectral filters between Transformer layers and using transform-truncate-reverse steps, the model achieves substantial speedups and memory savings with minimal performance loss. It reports state-of-the-art results on four of five Long Range Arena tasks and demonstrates strong results when inheriting BART weights for CNN/DailyMail and ELI5, including further improvements with light pretraining. The approach offers a practical, hardware-friendly path to efficient long-range modeling that can leverage existing pretrained models with modest additional pretraining.

Abstract

The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. Our code is publicly available at https://github.com/LUMIA-Group/FourierTransformer

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

TL;DR

This work tackles the quadratic cost of self-attention for long sequences by introducing Fourier Transformer, which down-samples hidden states through a 1D Discrete Cosine Transform implemented via FFT, while preserving compatibility with pretrained weights. By inserting spectral filters between Transformer layers and using transform-truncate-reverse steps, the model achieves substantial speedups and memory savings with minimal performance loss. It reports state-of-the-art results on four of five Long Range Arena tasks and demonstrates strong results when inheriting BART weights for CNN/DailyMail and ELI5, including further improvements with light pretraining. The approach offers a practical, hardware-friendly path to efficient long-range modeling that can leverage existing pretrained models with modest additional pretraining.

Abstract

The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. Our code is publicly available at https://github.com/LUMIA-Group/FourierTransformer
Paper Structure (32 sections, 8 equations, 3 figures, 4 tables)

This paper contains 32 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The power spectrum of input hidden states from different layers in the pretrained RoBERTa liu2019roberta model. The horizontal axes stand for frequency bins, starting from low frequency components on the left. The vertical axes are the corresponding amplitudes. Amplitudes are averaged over all hidden dimensions and over the entire validation set of Wiki-103 merity2016pointer. Since the inputs are real numbers, the positive and negative frequency components are pairwise conjugate. Thus we only plot the amplitude of the positive half of the frequencies.
  • Figure 2: Overall Model Architecture
  • Figure 3: R1, R2, RL and F1 on ELI5. x-axis stands for the retraning ratio $r$.