Table of Contents
Fetching ...

Understanding MLP-Mixer as a Wide and Sparse MLP

Tomohiro Hayase, Ryo Karakida

TL;DR

It is revealed that sparseness is a key mechanism underlying the MLP-Mixer, and quantitative similarities between the Mixer and the unstructured sparse-weight MLPs are empirically demonstrated.

Abstract

Multi-layer perceptron (MLP) is a fundamental component of deep learning, and recent MLP-based architectures, especially the MLP-Mixer, have achieved significant empirical success. Nevertheless, our understanding of why and how the MLP-Mixer outperforms conventional MLPs remains largely unexplored. In this work, we reveal that sparseness is a key mechanism underlying the MLP-Mixers. First, the Mixers have an effective expression as a wider MLP with Kronecker-product weights, clarifying that the Mixers efficiently embody several sparseness properties explored in deep learning. In the case of linear layers, the effective expression elucidates an implicit sparse regularization caused by the model architecture and a hidden relation to Monarch matrices, which is also known as another form of sparse parameterization. Next, for general cases, we empirically demonstrate quantitative similarities between the Mixer and the unstructured sparse-weight MLPs. Following a guiding principle proposed by Golubeva, Neyshabur and Gur-Ari (2021), which fixes the number of connections and increases the width and sparsity, the Mixers can demonstrate improved performance.

Understanding MLP-Mixer as a Wide and Sparse MLP

TL;DR

It is revealed that sparseness is a key mechanism underlying the MLP-Mixer, and quantitative similarities between the Mixer and the unstructured sparse-weight MLPs are empirically demonstrated.

Abstract

Multi-layer perceptron (MLP) is a fundamental component of deep learning, and recent MLP-based architectures, especially the MLP-Mixer, have achieved significant empirical success. Nevertheless, our understanding of why and how the MLP-Mixer outperforms conventional MLPs remains largely unexplored. In this work, we reveal that sparseness is a key mechanism underlying the MLP-Mixers. First, the Mixers have an effective expression as a wider MLP with Kronecker-product weights, clarifying that the Mixers efficiently embody several sparseness properties explored in deep learning. In the case of linear layers, the effective expression elucidates an implicit sparse regularization caused by the model architecture and a hidden relation to Monarch matrices, which is also known as another form of sparse parameterization. Next, for general cases, we empirically demonstrate quantitative similarities between the Mixer and the unstructured sparse-weight MLPs. Following a guiding principle proposed by Golubeva, Neyshabur and Gur-Ari (2021), which fixes the number of connections and increases the width and sparsity, the Mixers can demonstrate improved performance.
Paper Structure (60 sections, 4 theorems, 59 equations, 16 figures, 4 tables)

This paper contains 60 sections, 4 theorems, 59 equations, 16 figures, 4 tables.

Key Result

Proposition 3.2

The feature matrix of the S-Mixer (align:feature) is a shallow MLP with width $m=SC$ as follows:

Figures (16)

  • Figure 1: Schematic diagram of sparsity treated in this work. (a) A masked weight matrix $M \odot A$ in a sparse-weight MLP (SW-MLP). Its width is $O(1/\sqrt{p})$, where $p$ is the ratio of non-zero entries in the mask $M$. (b) A mixing layer in an MLP-Mixer with the vectorization. The weight behaves as a block diagonal matrix. (c) A weight of a random permuted mixer (RP-Mixer), which is introduced in \ref{['sec:beyond']}. The block diagonal structure is destroyed by random permutation matrices $J_1,J_2$ to achieve similarity to the SW-MLP.
  • Figure 2: (a) Average of diagonal entries of CKA between trained MLP-Mixer ($S=C=64,32$) and MLP with different sparsity, where $p$ is the ratio of non-zero entries in $M$. (b) CKA between MLP-Mixer ($S=C=64$) and MLP with the corresponding $p=1/64$, and (c) CKA between the Mixer and a dense MLP. (d) Test error on MNIST of shallow MLPs with Monarch matrix weights and Kronecker weights. The result is the average of five trials with different random seeds.
  • Figure 3: Theoretical line of $m$ and $p$ ($\Omega=10^8, \gamma=1$).
  • Figure 4: (left) Test error of MLPs with sparse weights and MLP-Mixers with different widths $\gamma m$ under the fixed $\Omega$. We set $\Omega=2^{19}$, $S=C=(\Omega/\gamma)^{1/3}$, and $\gamma=2$. The x-axis represents the effective width $\gamma m$. (right) The blue line indicates the averaged singular values of the weight $M \odot A$ of SW-MLP over five trials with different random seeds. The red line indicates $c_\gamma$, which is the square root of the right edge of the MP-Law.
  • Figure 5: Test error improved as the effective width increased. This figure presents S-Mixer, RP S-Mixer, and SW-MLP models on CIFAR-10 (left), CIFAR-100 (center), and STL-10 (right). Experiments were conducted using three different random seeds, and the mean test error is depicted. The observed standard deviations were less than 0.026 for CIFAR-10, 0.056 for CIFAR-100, and 0.008 for STL-10.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Proposition 3.2: Effective expression of MLP-Mixer as MLP
  • Proposition 3.3
  • Corollary 3.4
  • Definition 5.1: PK layer and PK family
  • Lemma B.1
  • proof