Table of Contents
Fetching ...

EMOv2: Pushing 5M Vision Model Frontier

Jiangning Zhang, Teng Hu, Haoyang He, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Dacheng Tao

TL;DR

EMOv2 rethinks lightweight vision backbones by unifying inverted residual CNN blocks with Transformer-style attention through the Meta Mobile Block and the novel i$^2$RMB module. The spanning SEW-MHSA mechanism models neighbor and distant interactions without adding parameters, enabling a 4-stage backbone built entirely from i$^2$RMB blocks that excels on ImageNet classification and downstream dense prediction tasks at a fixed 5M parameter budget. The approach yields consistent gains across object detection, semantic segmentation, video classification, UNet-style segmentation, and diffusion-based image generation, and shows clear scalability to larger magnitudes with strong training signals. The work delivers a practical, hardware-friendly pathway to high-performance, parameter-efficient vision models and provides open-source code for community adoption and extension.

Abstract

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at https://github.com/zhangzjn/EMOv2.

EMOv2: Pushing 5M Vision Model Frontier

TL;DR

EMOv2 rethinks lightweight vision backbones by unifying inverted residual CNN blocks with Transformer-style attention through the Meta Mobile Block and the novel iRMB module. The spanning SEW-MHSA mechanism models neighbor and distant interactions without adding parameters, enabling a 4-stage backbone built entirely from iRMB blocks that excels on ImageNet classification and downstream dense prediction tasks at a fixed 5M parameter budget. The approach yields consistent gains across object detection, semantic segmentation, video classification, UNet-style segmentation, and diffusion-based image generation, and shows clear scalability to larger magnitudes with strong training signals. The work delivers a practical, hardware-friendly pathway to high-performance, parameter-efficient vision models and provides open-source code for community adoption and extension.

Abstract

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4G/5G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1M/2M/5M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5M by +2.6. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level. Code is available at https://github.com/zhangzjn/EMOv2.

Paper Structure

This paper contains 21 sections, 5 equations, 8 figures, 24 tables.

Figures (8)

  • Figure 1: Top:Performancevs. Parameters with concurrent methods. Our EMOv2 achieves significant accuracy with fewer parameters. Superscript $\ast$: The comparison methods employ more robust training strategies described in their papers, while ours uses the strategy mentioned in \ref{['table:ablation_all']}(e). Bottom: The range of token interactions varies with different window attention mechanisms. Our EMOv2, with parameter-shared spanning attention in \ref{['section:iirmb']}, has a larger and correspondingly stronger Effective Receptive Field (ERF).
  • Figure 2: Left: Abstracted unified Meta-Mobile Block from Multi-Head Self-Attention, Feed-Forward Networktransformer, and Inverted Residual Blockmnetv2 (c.f. Sec \ref{['section:mmb']}). The inductive block can be deduced into specific modules using different expansion ratio$\lambda$ and efficient operator$\mathcal{F}$. Middle: We construct a family of vision models based on our i$^2$RMB module: 4-stage EMOv2, composed solely of the deduced i$^2$RMB (c.f. Sec \ref{['section:irmb']}), for various perception tasks (image classification, detection, and segmentation in \ref{['section:exp_downstream']}). Additionally, we introduce the temporally extended V-EMO for video classification, the U-EMO based on an encoder-decoder architecture, and D-EMO to replace the Transformer block in DiT dit. These downstream models are typically built based on the i$^2$RMB. Right: Performance comparison with different SoTAs on various tasks.
  • Figure 3: Meta-paradigm comparison between our MMBlock and MetaFormer metaformer. We integrate $\textcolor{rgb(176,36,24)}{\bm{\mathcal{F}}}$ into expended FFN to construct a more streamlined and shallower single-module block.
  • Figure 4: Detailed implementation comparison of the Inverted Residual Mobile Block (iRMB in \ref{['section:irmb']}) and the improved version (i$^2$RMB in \ref{['section:iirmb']}). i$^2$RMB designs a parameter-sharing spanning window attention mechanism that simultaneously models the interaction of distant and close window information.
  • Figure 5: Downstream gains of EMOv2-5M over EMOv1-5M.
  • ...and 3 more figures