Dimension Mixer: Group Mixing of Input Dimensions for Efficient Function Approximation

Suman Sapkota; Binod Bhattarai

Dimension Mixer: Group Mixing of Input Dimensions for Efficient Function Approximation

Suman Sapkota, Binod Bhattarai

TL;DR

This work proposes Dimension Mixer, a general, sparse signal-processing framework built from Select and Mix stages to flexibly and efficiently mix input dimensions. By extending FFT-inspired butterfly sparsity to non-linear mixers, it introduces Butterfly MLP and Butterfly Attention, and adds Patch-Only MLP-Mixer for 2D vision signals. Empirical results on CIFAR, Long Range Arena, and Pathfinder-X show that non-linear butterfly mixers achieve competitive accuracy with reduced parameters and compute, particularly excelling in long-range and large-sequence scenarios. Overall, the paper presents a unifying perspective on dimension mixing across CNNs, Transformers, and MLP-Mixers, highlighting scalable, structured approaches for efficient deep learning models.

Abstract

The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be interpreted through the lens of a general concept of dimension mixing. Research on coupling flows and the butterfly transform shows that partial and hierarchical signal mixing schemes are sufficient for efficient and expressive function approximation. In this work, we study group-wise sparse, non-linear, multi-layered and learnable mixing schemes of inputs and find that they are complementary to many standard neural architectures. Following our observations and drawing inspiration from the Fast Fourier Transform, we generalize Butterfly Structure to use non-linear mixer function allowing for MLP as mixing function called Butterfly MLP. We were also able to sparsely mix along sequence dimension for Transformer-based architectures called Butterfly Attention. Experiments on CIFAR and LRA datasets demonstrate that the proposed Non-Linear Butterfly Mixers are efficient and scale well when the host architectures are used as mixing function. Additionally, we propose Patch-Only MLP-Mixer for processing spatial 2D signals demonstrating a different dimension mixing strategy.

Dimension Mixer: Group Mixing of Input Dimensions for Efficient Function Approximation

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 7 figures, 8 tables)

This paper contains 19 sections, 2 equations, 7 figures, 8 tables.

Introduction
Background and Related Works
Dimension Mixer Models
Non-Linear Butterfly Mixer
Butterfly MLP
Butterfly Attention
Patch-Only MLP-Mixer for Vision
Experiments
Butterfly MLP
Butterfly Attention
Long Range Arena
Vision MLP Mixers
Conclusion
Solving Pathfinder-X
Effect of Patches, Block Size and Stride in Butterfly ViT
...and 4 more sections

Figures (7)

Figure 1: An example of a general Dimension Mixer model, split into multiple layers of (i) Select and (ii) Mix stages. The Select stage selects input dimensions for each of the Mixer units. The Mix stage processes inputs and is learned via optimization. This example shows that the mixers can use arbitrary dimensions and have varying capacity. The mixers can themselves be Dimension Mixer Model. Achieving dense parameterization is possible when mixing is performed such that there is a path from any input to any output dimension.
Figure 2: An example of FFT style Non-Linear Butterfly Mixer. This example shows the mixing of an 8-dimensional input signal using Radix-2 Butterfly. The first layer selects the dimension as it is. However, later layers use Permute to bring different dimensions in a block and later perform un-permute to place the dimension in their original location. For Radix-4 Butterfly a mixer block takes 4 dimensions as input and permutes accordingly as shown in Algorithm \ref{['algo:butterfly_mlp']}.
Figure 3: (a) An example of Butterfly Attention pattern on sequence length of 16 with butterfly structure of Radix-4. Using Radix-$\sqrt{N}$ creates two sparse attention matrix for complete mixing of signals. (b)(left) Patch Only MLP mixer (ours) compared to (right) Patch-Wise and Channel-Wise Mixing for 12x12 image size. The Channel-wise Mixing is replaced by Different size Patch-wise Mixing by our method.
Figure 4: The images with 8x8 = 64 tokens using butterfly attention of different block size (BS) and various strides. Each different color represents different blocks of attention. The partial mixing of signals on multiple layers can create complete mixing of every tokens. Here, mixing combinations of $(ab)$, $(cd)$$(cb)$, and $(cf)$ create complete mixing however, $(ae)$ and any same mixing like $(aa)$ does not mix every token.
Figure 5: ($left$) The standard multi-headed self attention (MHSA). ($mid$) Token group parallel attention with token split into multiple parallel but smaller MHSA ($right$) Most reduced form that uses 1 head per token-group and token dimension is mixed by the next MLP layer. The attention heads in all figures are labed as $h_i$.
...and 2 more figures

Dimension Mixer: Group Mixing of Input Dimensions for Efficient Function Approximation

TL;DR

Abstract

Dimension Mixer: Group Mixing of Input Dimensions for Efficient Function Approximation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)