Table of Contents
Fetching ...

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Jongseong Bae, Susang Kim, Minsu Cho, Ha Young Kim

TL;DR

MVFormer tackles efficiency and diversity in vision transformers by introducing MVN, which fuses BN, LN, and IN via learnable per-channel weights, and MVTM, a three-scale, stage-aware multiscale convolutional token mixer. Integrated into the MetaFormer block, these components yield the MVFormer with variants MVFormer-xT, T, S, and B that achieve state-of-the-art performance among convolution-based ViTs on ImageNet-1K and strong results on COCO and ADE20K, all with comparable or lower parameter counts and MACs. Ablation studies confirm the complementary benefits and the importance of stage-specific multiscale mixing and normalization diversity for performance gains. The work demonstrates that combining diverse normalization views with multiscale token mixing offers a scalable, efficient path for advancing ViT-based vision systems.

Abstract

Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark.

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

TL;DR

MVFormer tackles efficiency and diversity in vision transformers by introducing MVN, which fuses BN, LN, and IN via learnable per-channel weights, and MVTM, a three-scale, stage-aware multiscale convolutional token mixer. Integrated into the MetaFormer block, these components yield the MVFormer with variants MVFormer-xT, T, S, and B that achieve state-of-the-art performance among convolution-based ViTs on ImageNet-1K and strong results on COCO and ADE20K, all with comparable or lower parameter counts and MACs. Ablation studies confirm the complementary benefits and the importance of stage-specific multiscale mixing and normalization diversity for performance gains. The work demonstrates that combining diverse normalization views with multiscale token mixing offers a scalable, efficient path for advancing ViT-based vision systems.

Abstract

Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark.

Paper Structure

This paper contains 30 sections, 6 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Visualization of normalized images from BN, LN, IN, and their averages. These illustrate that BN and IN maintain the detailed spatial distribution of the input image, whereas LN overly smooths the image. We can intuitively observe the spatially smoothed output, including the local details, by taking a simple average.
  • Figure 2: Overall architecture of the proposed MVFormer and MVFormer block. Similar to MetaFormer, each block of MVFormer adopts a hierarchical architecture with four stages. Each $\mathrm{Stage}_j$ comprises $N_j$ blocks with a feature dimension $C_j$. The MVFormer block consists of two main components, MVN and MVTM, which can be compared to the MetaFormer block.
  • Figure 3: The average values of $\alpha_{LN}, \alpha_{BN}$ and $\alpha_{IN}$ for each block in MVFormer-S.
  • Figure 4: Comparison of training loss curves of different normalization methods, BN, LN, IN, and MVN, over 300 epochs.
  • Figure 5: Activation maps generated by Grad-CAM for the ConvFormer-S18 and MVFormer-T models.
  • ...and 1 more figures