CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

Jonathan Cui; David A. Araujo; Suman Saha; Md. Faisal Kabir

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

Jonathan Cui, David A. Araujo, Suman Saha, Md. Faisal Kabir

TL;DR

This work proposes CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for spatial-channel mixing through cross-scale local and global aggregation and achieves competitive results on popular image recognition benchmarks without incurring substantially more compute.

Abstract

Despite their simpler information fusion designs compared with Vision Transformers and Convolutional Neural Networks, Vision MLP architectures have demonstrated strong performance and high data efficiency in recent research. However, existing works such as CycleMLP and Vision Permutator typically model spatial information in equal-size spatial regions and do not consider cross-scale spatial interactions. Further, their token mixers only model 1- or 2-axis correlations, avoiding 3-axis spatial-channel mixing due to its computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for spatial-channel mixing through cross-scale local and global aggregation. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more compute. Our largest model, CS-Mixer-L, reaches 83.2% top-1 accuracy on ImageNet-1k with 13.7 GFLOPs and 94 M parameters.

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

TL;DR

Abstract

Paper Structure (19 sections, 6 equations, 5 figures, 3 tables)

This paper contains 19 sections, 6 equations, 5 figures, 3 tables.

Introduction
Related Works
Vision Transformers.
Vision MLP Architectures.
Methodology
Notation.
Preliminaries
Network Architecture
Cross-Scale Embedding Layer.
Backbone Stages.
Cross-Scale Patch Merging Layer.
Classifier Head.
CS-Mixer Operator
Experiments
ImageNet-1k Classification
...and 4 more sections

Figures (5)

Figure 1: ImageNet-1k top-1 accuracy vs. model size, with resolution $224\times224$ and no extra training data.
Figure 2: The general architecture of Vision Transformers dosovitskiy2021an.
Figure 3: An illustration of CS-Mixer's cross-scale patch embedding mechanism.
Figure 4: Model FLOP count vs. number of parameters. CS-Mixers require competitively low compute at the same number of parameters compared with previous models.
Figure 5: A visualization of how the first output neuron connects to inputs from the first channel in the first head, from the four stages (columns). The first row comes from the first LA layer in each stage and the second from the first GA layer respectively.

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

TL;DR

Abstract

CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)