Table of Contents
Fetching ...

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

Abstract

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

Abstract

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

Paper Structure

This paper contains 31 sections, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Comparison of hardware efficiency of different vision backbone architecture families on the Nvidia Jetson TX2. Models in the top-left offer the best hardware efficiency on the Jetson TX2. Both axes are in logarithmic scale. LowFormer base models (B0-B3) outperform all architectures in hardware efficiency, with edge variants (E1-E3) further enhancing efficiency
  • Figure 2: Comparison of Nvidia Jetson TX2 latency and top-1 accuracy for state-of-the-art vision backbones with LowFormer. Models in the top-left offer the best speed-accuracy trade-off. LowFormer consistently achieves lower latency for similar accuracy. Its edge variants (E1/E2/E3) further enhance this trade-off over the base models (B0-B3)
  • Figure 3: The left figures a) and c) depict the average execution time (a) and latency (c) of the fused mobile inverted bottleneck (MBConv) relative to the unfused one, while the right figures b) and d) depict the relative amount of MACs. The blue areas in the left figures a) and c) correspond to configurations (number of channels and resolution) where fused MBConv is faster, while red corresponds to the opposite. For the figures b) and d), the red areas correspond to configurations where the fused MBConv has a higher amount of MACs. Bold numbers refer to entries with a particularly unequal ratio. Even though the fused MBConv always has more MACs, it is faster for many configurations
  • Figure 4: Structure of the fused and unfused MBConv block. $C$ refers to the channel dimension. Both have an expansion factor of 4
  • Figure 5: Lowtention block design. LN refers to layer normalization. In contrast to the traditional MHSA, we encapsulate the SDA with two depthwise convolutions (the second is a transposed depthwise convolution). The projections for MHSA are realized with pointwise convolutions. The $DW\downarrow_n$ means that the resolution is downscaled by the factor $n$ and $DW\uparrow_n$ that it is upscaled by $n$
  • ...and 3 more figures