MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Jongseong Bae; Susang Kim; Minsu Cho; Ha Young Kim

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

Jongseong Bae, Susang Kim, Minsu Cho, Ha Young Kim

TL;DR

MVFormer tackles efficiency and diversity in vision transformers by introducing MVN, which fuses BN, LN, and IN via learnable per-channel weights, and MVTM, a three-scale, stage-aware multiscale convolutional token mixer. Integrated into the MetaFormer block, these components yield the MVFormer with variants MVFormer-xT, T, S, and B that achieve state-of-the-art performance among convolution-based ViTs on ImageNet-1K and strong results on COCO and ADE20K, all with comparable or lower parameter counts and MACs. Ablation studies confirm the complementary benefits and the importance of stage-specific multiscale mixing and normalization diversity for performance gains. The work demonstrates that combining diverse normalization views with multiscale token mixing offers a scalable, efficient path for advancing ViT-based vision systems.

Abstract

Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi-view normalization (MVN) and a token mixer called multi-view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution-based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi-vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state-of-the-art convolution-based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer-T, S, and B achieve 83.4%, 84.3%, and 84.6% top-1 accuracy, respectively, on ImageNet-1K benchmark.

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

TL;DR

Abstract

MVFormer: Diversifying Feature Normalization and Token Mixing for Efficient Vision Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)