Table of Contents
Fetching ...

Reparameterized Multi-Resolution Convolutions for Long Sequence Modelling

Harry Jake Cunningham, Giorgio Giannone, Mingtian Zhang, Marc Peter Deisenroth

TL;DR

The paper tackles the challenge of modelling extremely long sequences by introducing MRConv, a parameter-efficient, reparameterized multi-resolution convolution framework. MRConv builds long global kernels as learnable sums of low-rank sub-kernels across multiple resolutions, trained in parallel via causal structural reparameterization and merged into a single kernel for inference. It offers three kernel parameterizations—dilated, Fourier, and sparse—along with FFT-based Convolutions to maintain efficiency. Across Long Range Arena, sCIFAR, Speech Commands, and ImageNet, MRConv achieves state-of-the-art results among convolutional models and linear-time transformers while improving efficiency, validating its applicability across diverse modalities.

Abstract

Global convolutions have shown increasing promise as powerful general-purpose sequence models. However, training long convolutions is challenging, and kernel parameterizations must be able to learn long-range dependencies without overfitting. This work introduces reparameterized multi-resolution convolutions ($\texttt{MRConv}$), a novel approach to parameterizing global convolutional kernels for long-sequence modelling. By leveraging multi-resolution convolutions, incorporating structural reparameterization and introducing learnable kernel decay, $\texttt{MRConv}$ learns expressive long-range kernels that perform well across various data modalities. Our experiments demonstrate state-of-the-art performance on the Long Range Arena, Sequential CIFAR, and Speech Commands tasks among convolution models and linear-time transformers. Moreover, we report improved performance on ImageNet classification by replacing 2D convolutions with 1D $\texttt{MRConv}$ layers.

Reparameterized Multi-Resolution Convolutions for Long Sequence Modelling

TL;DR

The paper tackles the challenge of modelling extremely long sequences by introducing MRConv, a parameter-efficient, reparameterized multi-resolution convolution framework. MRConv builds long global kernels as learnable sums of low-rank sub-kernels across multiple resolutions, trained in parallel via causal structural reparameterization and merged into a single kernel for inference. It offers three kernel parameterizations—dilated, Fourier, and sparse—along with FFT-based Convolutions to maintain efficiency. Across Long Range Arena, sCIFAR, Speech Commands, and ImageNet, MRConv achieves state-of-the-art results among convolutional models and linear-time transformers while improving efficiency, validating its applicability across diverse modalities.

Abstract

Global convolutions have shown increasing promise as powerful general-purpose sequence models. However, training long convolutions is challenging, and kernel parameterizations must be able to learn long-range dependencies without overfitting. This work introduces reparameterized multi-resolution convolutions (), a novel approach to parameterizing global convolutional kernels for long-sequence modelling. By leveraging multi-resolution convolutions, incorporating structural reparameterization and introducing learnable kernel decay, learns expressive long-range kernels that perform well across various data modalities. Our experiments demonstrate state-of-the-art performance on the Long Range Arena, Sequential CIFAR, and Speech Commands tasks among convolution models and linear-time transformers. Moreover, we report improved performance on ImageNet classification by replacing 2D convolutions with 1D layers.
Paper Structure (49 sections, 1 theorem, 18 equations, 5 figures, 13 tables, 3 algorithms)

This paper contains 49 sections, 1 theorem, 18 equations, 5 figures, 13 tables, 3 algorithms.

Key Result

Theorem 1

As state size $N\rightarrow \infty$, the SSM in Eq eq:ssm_fourier is a time-invariant orthogonal state space model defined by the truncated Fourier basis functions, orthonormal on $[0,1]$, $\{p_n \}_{n\ge 0} = [1, c_0(t), s_0(t), \cdots]$, where $c_m(t)=\sqrt{2}\cos (2\pi m t)$ and $s_m(t) = \sqrt{2

Figures (5)

  • Figure 1: Left: The MRConv block is composed of a MRConv layer, GELU activation, pointwise linear layer, to mix the channels, and a gated linear unit. Middle: During training, the MRConv layer processes the input using $N$branches each with it's own convolution kernel of increasing length and BatchNorm parameters. The output of the layer is given by pointwise multiplying each branch by $\alpha_i$ and summing. Right: At inference the branches can be reparameterised into a single convolution.
  • Figure 2: Multi-resolution structural reparameterization. During training, we parameterize each branch with a kernel of increasing length but fixed number of parameters. For the Fourier kernels, we use only a handful of low-frequency modes and for the dilated kernels we increase the dilation factor. At inference, we combine the kernels into a single kernel by merging the BN parameters with the kernel parameters and performing a learnt weighted summation.
  • Figure 3: Left: ImageNet Top-1 Acc. vs. Throughput. Right: Distribution of $\bm{\alpha}$ norms for each depth for MRConv trained on ListOps and CIFAR respectively. Changing composition of kernels highlights how the convolution kernels are non-stationary with respect to depth.
  • Figure 4: ImageNet Top-1 Accuracy vs. Throughput. Enlarged version of Figure \ref{['fig:imagenet_throughput']} from the main body of the paper.
  • Figure 5: Visualization of learned Kernels from MRConvNeXt at different stages for both Fourier and Fourier + Sparse parameterizations.

Theorems & Definitions (4)

  • Remark
  • Remark
  • Remark
  • Theorem : gu2022train 6.