Table of Contents
Fetching ...

MobileNetV4 -- Universal Models for the Mobile Ecosystem

Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard

TL;DR

The Universal Inverted Bottleneck (UIB) search block is introduced, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant.

Abstract

We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.

MobileNetV4 -- Universal Models for the Mobile Ecosystem

TL;DR

The Universal Inverted Bottleneck (UIB) search block is introduced, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant.

Abstract

We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe is also introduced which improves MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Finally, to further boost accuracy, we introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.
Paper Structure (21 sections, 2 equations, 11 figures, 17 tables)

This paper contains 21 sections, 2 equations, 11 figures, 17 tables.

Figures (11)

  • Figure 1: MNv4 Models are Universally Mostly Pareto Optimal: MNv4 performs strongly compared to leading efficient models across diverse hardware. All models were trained on ImageNet-1k solely. MobileNetV1-V3 were retrained with updated recipes. Most models were optimized for one device, but MNv4 is Pareto optimal across most devices. Hybrid models and ConvNext are DSP-incompatible. Due to PyTorch-to-TFLite export tool limitations, EfficientViTs mit-efficientvitmsr-efficientvit are not benchmarked on CPUs and EdgeTPU. MNv4-Hybrid models were excluded from CoreML evaluation due to the lack of PyTorch implementation of Mobile MQA.
  • Figure 2: Ridge Points and Latency/Accuracy Trade-Offs: In the roofline performance model, the ridge point summarizes the relationship between memory bandwidth and MACs. If memory bandwidth is constant, high-compute hardware (accelerators) have a higher ridge point than low-compute hardware (CPUs). MobileNetV4 is mostly Pareto-optimal from a ridge point of 0 to 500 MACs/byte. These analytically-derived (\ref{['eq:roofline_model']}) charts reflect the real hardware measurements in \ref{['fig:multi_hardware_pareto']}. \ref{['appendix:universality']} contains further analysis of this relationship.
  • Figure 3: Op Cost vs. Ridge Point: Each sub-chart displays the roofline latency (\ref{['eq:roofline_model']}) of a network's ops. Networks start on the left. Large Conv2Ds are expensive on low ridge point (RP) hardware (top row), but add cheap model capacity on high-RP hardware (bottom row). FC layers and DW-Conv2Ds are cheap at low RPs and expensive at high RPs. MobileNetV4 balances MAC-intensive Conv2D layers and memory-intensive FC layers where they contribute most to the network---the beginning and end, respectively. Full sweeps and data for all MobileNetV4-Conv models are in \ref{['appendix:universality']}.
  • Figure 4: Universal Inverted Bottleneck (UIB) blocks.
  • Figure 5: MNv4 Models are Universally Mostly Pareto Optimal: This is the same chart as \ref{['fig:multi_hardware_pareto']}, but expanded to be easier to read.
  • ...and 6 more figures