Table of Contents
Fetching ...

CycleMLP: A MLP-like Architecture for Dense Prediction

Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo

TL;DR

CycleMLP introduces CycleFC, a cycle-based fully-connected operator that increases spatial receptive field while preserving input-scale independence and linear complexity. Building a hierarchical, four-stage backbone with three parallel CycleFC branches, CycleMLP delivers competitive accuracy on ImageNet and strong performance on dense tasks like COCO and ADE20K, often surpassing prior MLP-like models and matching Transformer-based backbones. The work demonstrates effective dense-prediction backbones without self-attention, highlighting improved robustness to corruptions and excellent resolution adaptability across varying input sizes. The practical impact lies in providing a lightweight, scalable alternative for dense vision tasks that maintains high accuracy with lower computational cost.

Abstract

This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have $O(N^2)$ computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.

CycleMLP: A MLP-like Architecture for Dense Prediction

TL;DR

CycleMLP introduces CycleFC, a cycle-based fully-connected operator that increases spatial receptive field while preserving input-scale independence and linear complexity. Building a hierarchical, four-stage backbone with three parallel CycleFC branches, CycleMLP delivers competitive accuracy on ImageNet and strong performance on dense tasks like COCO and ADE20K, often surpassing prior MLP-like models and matching Transformer-based backbones. The work demonstrates effective dense-prediction backbones without self-attention, highlighting improved robustness to corruptions and excellent resolution adaptability across varying input sizes. The practical impact lies in providing a lightweight, scalable alternative for dense vision tasks that maintains high accuracy with lower computational cost.

Abstract

This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions. As compared to modern MLP architectures, e.g., MLP-Mixer, ResMLP, and gMLP, whose architectures are correlated to image size and thus are infeasible in object detection and segmentation, CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have computations due to fully spatial connections. We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e.g., Swin Transformer, while using fewer parameters and FLOPs. We expand the MLP-like models' applicability, making them a versatile backbone for dense prediction tasks. CycleMLP achieves competitive results on object detection, instance segmentation, and semantic segmentation. In particular, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset. Code is available at https://github.com/ShoufaChen/CycleMLP.

Paper Structure

This paper contains 22 sections, 10 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: (a)-(c): motivation of Cycle Fully-Connected Layer (Cycle FC) compared to Channel FC and Spatial FC. (a) Channel FC aggregates features in the channel dimension with spatial size '1'. It can handle various input scales but cannot learn spatial context. (b) Spatial FC mlp-mixerresmlpgmlp has a global receptive field in the spatial dimension. However, its parameter size is fixed and it has quadratic computational complexity to image scale. (c) Our proposed Cycle Fully-Connected Layer (Cycle FC) has linear complexity the same as channel FC and a larger receptive field than Channel FC. (d)-(f): Three examples of different stepsizes.Orange blocks denote the sampled positions. $\bigstar$ denotes the output position. For simplicity, we omit batch dimension and set the feature's width to 1 here for example. Several more general cases can be found in Figure \ref{['fig:general']} (Appendix \ref{['sec:appendix-general']}). Best viewed in color.
  • Figure 2: ImageNet accuracy v.s. model capacity. All models are trained on ImageNet-1K deng2009imagenet without extra data. CycleMLP surpasses existing MLP-like models such as MLP-Mixer mlp-mixer, ResMLP resmlp, gMLP gmlp, S$^2$-MLP yu2021s and ViP hou2021vision.
  • Figure 3: Resolution adaptability. All models are trained on 224$\times$224 and evaluated on various resolutions without fine-tuning. Left: Absolute top-1 accuracy; Right: Accuracy difference relative to that tested on 224$\times$224. The superiority of CycleMLP's robustness becomes more significant when scale varies to a greater extent.
  • Figure 4: Effective Receptive Field (ERF). We visualize the ERFs of the last stage for both Swin liu2021Swin and CycleMLP. Best viewed with zoom in.
  • Figure 5: Comparison of MLP blocks in details.
  • ...and 3 more figures