Table of Contents
Fetching ...

Mobile-Former: Bridging MobileNet and Transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, Zicheng Liu

TL;DR

Mobile-Former introduces a parallel MobileNet–Transformer architecture with a lightweight two-way bridge that enables bidirectional fusion of local and global features using very few global tokens. This design achieves superior accuracy under low FLOP budgets on ImageNet and delivers strong object-detection performance, including a faster, end-to-end detector that outperforms DETR with substantially fewer FLOPs and parameters. Across a range of FLOPs (26M–508M), Mobile-Former consistently surpasses efficient CNNs and ViT variants in the low-cost regime. The work demonstrates how a compact transformer and efficient cross-attention bridge can complement MobileNet’s local processing, offering a new design paradigm for efficient vision models and detectors.

Abstract

We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.

Mobile-Former: Bridging MobileNet and Transformer

TL;DR

Mobile-Former introduces a parallel MobileNet–Transformer architecture with a lightweight two-way bridge that enables bidirectional fusion of local and global features using very few global tokens. This design achieves superior accuracy under low FLOP budgets on ImageNet and delivers strong object-detection performance, including a faster, end-to-end detector that outperforms DETR with substantially fewer FLOPs and parameters. Across a range of FLOPs (26M–508M), Mobile-Former consistently surpasses efficient CNNs and ViT variants in the low-cost regime. The work demonstrates how a compact transformer and efficient cross-attention bridge can complement MobileNet’s local processing, offering a new design paradigm for efficient vision models and detectors.

Abstract

We present Mobile-Former, a parallel design of MobileNet and transformer with a two-way bridge in between. This structure leverages the advantages of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different from recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized to learn global priors, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power. It outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at 294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end detector by replacing backbone, encoder and decoder in DETR with Mobile-Former, which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\% of parameters.

Paper Structure

This paper contains 16 sections, 4 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Overview of Mobile-Former, which parallelizes MobileNet sandler2018mobilenetv2 on the left side and Transformer NIPS2017_transformer on the right side. Different from vision transformer dosovitskiy2021vit that uses image patches to form tokens, the transformer in Mobile-Former takes very few learnable tokens as input that are randomly initialized. Mobile (refers to MobileNet) and Former (refers to transformer) communicate through a bidirectional bridge, which is modeled by the proposed light-weight cross attention. Best viewed in color.
  • Figure 2: Comparison among Mobile-Former, efficient CNNs and vision transformers, in terms of accuracy over FLOPs. The comparison is performed on ImageNet classification. Mobile-Former consistently outperforms both efficient CNNs and vision transformers in low FLOP regime (from 25M to 500M MAdds). Note that we implement Swin liu2021Swin and DeiT touvron2020deit at low computational budget from 100M to 2G FLOPs. Best viewed in color.
  • Figure 3: Mobile-Former block that includes four modules: Mobile sub-block modifies inverted bottleneck block in sandler2018mobilenetv2 by replacing ReLU with dynamic ReLU Chen2020DynamicReLU. Mobile$\rightarrow$Former uses light-weight cross attention to fuse local features into global features. Former sub-block is a standard transformer block including multi-head attention and FFN. Note that the output of Former is used to generate parameters for dynamic ReLU in Mobile sub-block. Mobile$\leftarrow$Former bridges from global to local features.
  • Figure 4: Mobile-Former for object detection. Both backbone and head use Mobile-Former blocks (see Figure \ref{['fig:overview']}, \ref{['fig:MF-block']}). The backbone has 6 global tokens while the head has 100 object queries. All object queries pass through multiple resolutions ($\frac{1}{32}$, $\frac{1}{16}$, $\frac{1}{8}$) in the head. Similar to DETR nicolas2020detr, feed forward network (FFN) is used to predict class label and bounding box. Best viewed in color.
  • Figure 5: Inference latency over different image sizes. The latency is measured on an Intel(R) Xeon(R) CPU E5-2650 v3 (2.3GHz), following the common settings (single-thread with batch size 1) in sandler2018mobilenetv2Howard_2019_ICCV_mbnetv3. Mobile-Former-214M is compared with MobileNetV3 Large Howard_2019_ICCV_mbnetv3 as they have similar FLOPs (214M vs. 217M). Mobile-Former is slower when image size is small, but has faster inference than MobileNetV3 as image size grows above 750$\times$750. Best viewed in color.
  • ...and 4 more figures