Table of Contents
Fetching ...

Caterpillar: A Pure-MLP Architecture with Shifted-Pillars-Concatenation

Jin Sun, Xiaoshuang Shi, Zhiyuan Wang, Kaidi Xu, Heng Tao Shen, Xiaofeng Zhu

TL;DR

This paper proposes a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features.

Abstract

Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and lower parallel computation. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and parallelizable method to exploit locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. SPC module offers superior local modeling power and performance gains, making it a promising alternative to the convolutional layer. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Extensive experiments show Caterpillar's excellent performance on both small-scale and ImageNet-1k classification benchmarks, with remarkable scalability and transfer capability possessed as well. The code is available at https://github.com/sunjin19126/Caterpillar.

Caterpillar: A Pure-MLP Architecture with Shifted-Pillars-Concatenation

TL;DR

This paper proposes a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features.

Abstract

Modeling in Computer Vision has evolved to MLPs. Vision MLPs naturally lack local modeling capability, to which the simplest treatment is combined with convolutional layers. Convolution, famous for its sliding window scheme, also suffers from this scheme of redundancy and lower parallel computation. In this paper, we seek to dispense with the windowing scheme and introduce a more elaborate and parallelizable method to exploit locality. To this end, we propose a new MLP module, namely Shifted-Pillars-Concatenation (SPC), that consists of two steps of processes: (1) Pillars-Shift, which generates four neighboring maps by shifting the input image along four directions, and (2) Pillars-Concatenation, which applies linear transformations and concatenation on the maps to aggregate local features. SPC module offers superior local modeling power and performance gains, making it a promising alternative to the convolutional layer. Then, we build a pure-MLP architecture called Caterpillar by replacing the convolutional layer with the SPC module in a hybrid model of sMLPNet. Extensive experiments show Caterpillar's excellent performance on both small-scale and ImageNet-1k classification benchmarks, with remarkable scalability and transfer capability possessed as well. The code is available at https://github.com/sunjin19126/Caterpillar.
Paper Structure (26 sections, 8 equations, 6 figures, 17 tables)

This paper contains 26 sections, 8 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: (a) The convolutional layer sequentially slides a local window across each pillar (token) with a larger receptive field (i.e., the colored border), leading to low parallel computation and redundant representation. (b) The proposed SPC module adopts a window-free strategy. It applies four linear filters which encode the local features for all pillars in parallel from their neighbors of four directions, exploiting the locality elaborately and simultaneously.
  • Figure 2: The SPC module consists of two processes: Pillars-Shift (Shift + Pad) and Pillars-Concatenation (Reduce + Concat + Fuse). In Pillars-Shift, the input image is recurrently shifted along four directions to create neighboring maps, while Pad is used to maintain the feature size by padding these maps with pillars of specific values. In Pillars-Concatenation, Reduce is achieved through four C $\times$ C/4 linear projections, and Fuse is accomplished through a C $\times$ C linear projection, where C represents the number of input feature channels.
  • Figure 3: The structures of sMLPNet and Caterpillar blocks.
  • Figure 4: Different ways to combine local and global information.
  • Figure A1: The feature maps of six samples with 3 rows and 4 columns. Each row represents a specific local modeling approach: identity (Iden.), convolution (Conv.) and SPC. The columns are the maps in different phases of Caterpillar (CPr.)-T.
  • ...and 1 more figures