Table of Contents
Fetching ...

Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling

Daniel Gallo Fernández

TL;DR

Poolformer introduces a pooling-based recurrent architecture for long-sequence modeling, replacing self-attention with RG-LRU-inspired temporal mixing and using SkipBlocks with down/up pooling to dramatically reduce sequence length. The approach yields faster training, better perceptual metrics (FID/IS), and robust generalization on raw audio, while revealing that deep layers capture long-range dependencies and shallow layers focus on short-term features. Empirical results on SC09, Beethoven, and YouTubeMix show competitive log-likelihood with state-of-the-art methods like SaShiMi and Mamba, and superior efficiency and perceptual quality due to pooling. The work outlines clear paths for extending Poolformer to text, vision, and multi-modal settings, potentially enabling Poolformer-based LLMs that process dense representations of images and videos.

Abstract

Sequence-to-sequence models have become central in Artificial Intelligence, particularly following the introduction of the transformer architecture. While initially developed for Natural Language Processing, these models have demonstrated utility across domains, including Computer Vision. Such models require mechanisms to exchange information along the time dimension, typically using recurrent or self-attention layers. However, self-attention scales quadratically with sequence length, limiting its practicality for very long sequences. We introduce Poolformer, a sequence-to-sequence model that replaces self-attention with recurrent layers and incorporates pooling operations to reduce sequence length. Poolformer is defined recursively using SkipBlocks, which contain residual blocks, a down-pooling layer, a nested SkipBlock, an up-pooling layer, and additional residual blocks. We conduct extensive experiments to support our architectural choices. Our results show that pooling greatly accelerates training, improves perceptual metrics (FID and IS), and prevents overfitting. Our experiments also suggest that long-range dependencies are handled by deep layers, while shallow layers take care of short-term features. Evaluated on raw audio, which naturally features long sequence lengths, Poolformer outperforms state-of-the-art models such as SaShiMi and Mamba. Future directions include applications to text and vision, as well as multi-modal scenarios, where a Poolformer-based LLM could effectively process dense representations of images and videos.

Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling

TL;DR

Poolformer introduces a pooling-based recurrent architecture for long-sequence modeling, replacing self-attention with RG-LRU-inspired temporal mixing and using SkipBlocks with down/up pooling to dramatically reduce sequence length. The approach yields faster training, better perceptual metrics (FID/IS), and robust generalization on raw audio, while revealing that deep layers capture long-range dependencies and shallow layers focus on short-term features. Empirical results on SC09, Beethoven, and YouTubeMix show competitive log-likelihood with state-of-the-art methods like SaShiMi and Mamba, and superior efficiency and perceptual quality due to pooling. The work outlines clear paths for extending Poolformer to text, vision, and multi-modal settings, potentially enabling Poolformer-based LLMs that process dense representations of images and videos.

Abstract

Sequence-to-sequence models have become central in Artificial Intelligence, particularly following the introduction of the transformer architecture. While initially developed for Natural Language Processing, these models have demonstrated utility across domains, including Computer Vision. Such models require mechanisms to exchange information along the time dimension, typically using recurrent or self-attention layers. However, self-attention scales quadratically with sequence length, limiting its practicality for very long sequences. We introduce Poolformer, a sequence-to-sequence model that replaces self-attention with recurrent layers and incorporates pooling operations to reduce sequence length. Poolformer is defined recursively using SkipBlocks, which contain residual blocks, a down-pooling layer, a nested SkipBlock, an up-pooling layer, and additional residual blocks. We conduct extensive experiments to support our architectural choices. Our results show that pooling greatly accelerates training, improves perceptual metrics (FID and IS), and prevents overfitting. Our experiments also suggest that long-range dependencies are handled by deep layers, while shallow layers take care of short-term features. Evaluated on raw audio, which naturally features long sequence lengths, Poolformer outperforms state-of-the-art models such as SaShiMi and Mamba. Future directions include applications to text and vision, as well as multi-modal scenarios, where a Poolformer-based LLM could effectively process dense representations of images and videos.

Paper Structure

This paper contains 50 sections, 9 theorems, 72 equations, 19 figures, 7 tables.

Key Result

Theorem 1

The inverse discrete Fourier transform can be expressed in terms of the discrete Fourier transform.

Figures (19)

  • Figure 1: $\mu$-law encoding visualization. The $\mu$-law function (left) compresses large amplitudes and expands small ones. When applied to a signal (right), it produces a "stretched" waveform.
  • Figure 2: Symmetries of the discrete Fourier transform. (Left) Roots of unity in the complex plane. (Right) Structure of the DFT matrix illustrating periodicity and symmetry.
  • Figure 3: Effect of windowing on the STFT. Figure (a) shows a signal with exactly five periods of a sine wave, where the start and end points align smoothly. As a result, the DFT clearly identifies the 5 Hz frequency. In Figure (b), the signal contains five and a half periods, and the endpoints do not match. This causes spectral leakage, spreading power into neighboring frequency bins beyond just 5 and 6 Hz. In Figure (c), the signal is multiplied by a Hann window, which reduces the discontinuities at the boundaries. Consequently, the magnitude spectrum becomes more concentrated around the 5 and 6 Hz components.
  • Figure 4: In (a) we see a waveform that corresponds to a one-second utterance from the SC09 dataset (see Section \ref{['sec:datasets']}). After applying the short-term Fourier transform (STFT), we get a complex-valued matrix. The phase (b) seems quite random, but the magnitude (c) is very structured. Finally, (d) shows the mel-spectrogram.
  • Figure 5: Associative scan algorithm (taken from associative_scan). The associative scan algorithm computes prefix sums (or any associative operation) in $O(\log S)$ time using parallel computation. First, the up-sweep is performed, where consecutive nodes are combined. Then, the root is set to the identity (zero in the case of the sum). For the down-sweep, the value at a parent node is passed directly to its left child. The value passed to the right child is the parent's value combined with the up-sweep result from its left sibling.
  • ...and 14 more figures

Theorems & Definitions (36)

  • Definition 1: $\mu$-law encoding
  • Definition 2: $\mu$-law decoding
  • Definition 3: Fourier transform
  • Definition 4: Inverse Fourier transform
  • Definition 5: Discrete Fourier transform
  • Definition 6: Discrete inverse Fourier transform
  • Theorem 1
  • proof
  • Definition 7: Circular Convolution
  • Theorem 2: Convolution Theorem
  • ...and 26 more