Table of Contents
Fetching ...

AS-MLP: An Axial Shifted MLP Architecture for Vision

Dongze Lian, Zehao Yu, Xing Sun, Shenghua Gao

TL;DR

AS-MLP introduces axial shifts to inject locality into a pure-MLP vision backbone, addressing the locality gap in prior MLP-based models. It achieves competitive ImageNet performance (83.3% Top-1 for AS-MLP-B) with four-stage Swin-like architecture and demonstrates transferability to downstream tasks (COCO detection and ADE20K segmentation) at competitive costs relative to Swin Transformer. Through ablations, the authors show that appropriate shift size, padding, and parallel connections maximize local feature interaction while maintaining efficiency. The work positions MLP-based architectures as viable alternatives to convolutional and transformer backbones for both classification and downstream vision tasks, with potential applications in NLP as future work.

Abstract

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features interaction. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, in the same spirit of convolutional neural networks. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Our AS-MLP establishes a strong baseline of MLP-based architecture. Code is available at https://github.com/svip-lab/AS-MLP.

AS-MLP: An Axial Shifted MLP Architecture for Vision

TL;DR

AS-MLP introduces axial shifts to inject locality into a pure-MLP vision backbone, addressing the locality gap in prior MLP-based models. It achieves competitive ImageNet performance (83.3% Top-1 for AS-MLP-B) with four-stage Swin-like architecture and demonstrates transferability to downstream tasks (COCO detection and ADE20K segmentation) at competitive costs relative to Swin Transformer. Through ablations, the authors show that appropriate shift size, padding, and parallel connections maximize local feature interaction while maintaining efficiency. The work positions MLP-based architectures as viable alternatives to convolutional and transformer backbones for both classification and downstream vision tasks, with potential applications in NLP as future work.

Abstract

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Different from MLP-Mixer, where the global spatial feature is encoded for information flow through matrix transposition and one token-mixing MLP, we pay more attention to the local features interaction. By axially shifting channels of the feature map, AS-MLP is able to obtain the information flow from different axial directions, which captures the local dependencies. Such an operation enables us to utilize a pure MLP architecture to achieve the same local receptive field as CNN-like architecture. We can also design the receptive field size and dilation of blocks of AS-MLP, etc, in the same spirit of convolutional neural networks. With the proposed AS-MLP architecture, our model obtains 83.3% Top-1 accuracy with 88M parameters and 15.2 GFLOPs on the ImageNet-1K dataset. Such a simple yet effective architecture outperforms all MLP-based architectures and achieves competitive performance compared to the transformer-based architectures (e.g., Swin Transformer) even with slightly lower FLOPs. In addition, AS-MLP is also the first MLP-based architecture to be applied to the downstream tasks (e.g., object detection and semantic segmentation). The experimental results are also impressive. Our proposed AS-MLP obtains 51.5 mAP on the COCO validation set and 49.5 MS mIoU on the ADE20K dataset, which is competitive compared to the transformer-based architectures. Our AS-MLP establishes a strong baseline of MLP-based architecture. Code is available at https://github.com/svip-lab/AS-MLP.

Paper Structure

This paper contains 27 sections, 5 equations, 8 figures, 11 tables, 1 algorithm.

Figures (8)

  • Figure 1: A tiny version of the overall Axial Shifted MLP (AS-MLP) architecture.
  • Figure 2: (a) shows the structure of the AS-MLP block; (b) shows the horizontal shift, where the arrows indicate the steps, and the number in each box is the index of the feature.
  • Figure 3: Code of AS-MLP Block in a PyTorch-like style.
  • Figure 4: The visualization of features from Swin Transformer and our AS-MLP.
  • Figure 5: The throughput curve when the batch size is 1, 4, 8, 16, 32, 64, 128, respectively.
  • ...and 3 more figures