Table of Contents
Fetching ...

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

TL;DR

This work analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications.

Abstract

Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

TL;DR

This work analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications.

Abstract

Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.
Paper Structure (29 sections, 6 figures, 8 tables)

This paper contains 29 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison of GPU throughput and top-1 accuracy of recent image classification architectures and LowFormer. The markers refer to different complexity classes of corresponding architecture families. LowFormer consistently achieves a higher throughput than models with similar accuracy.
  • Figure 2: The left figures depict average execution time on GPU of the fused mobile inverted bottleneck (MBConv) relative to the unfused one, while the right figures depict the relative amount of MACs. The blue areas in the left figures correspond to configurations (number of channels and resolution) where fused MBConv is faster, while red corresponds to the opposite. For the right figures the red areas correspond to configurations where the fused MBConv has a higher amount of MACs. Bold and italic numbers refer to entries with a particularly unequal ratio. Even though the fused MBConv always has more MACs, it is faster for many configurations.
  • Figure 3: Architecture of LowFormer. The resolutions refer to a 224x224 sized input. LowFormer block can be seen in \ref{['fig:lowformerblock']}. MBConv means the mobile inverted bottleneck block, Conv means convolution and Cls head refers to the image classification head. Specification of $C_0 - C_4$ and $L_0 - L_4$ can be found in \ref{['tab:archnumbers']}.
  • Figure 4: Lowformer block design. DWConv, PWConv, LN, MLP and SDA mean depthwise convolution, pointwise convolution, layer normalization, multi-layer perceptron and Scaled Dot-Product Attention respectively. In contrast to the traditional MHSA, we encapsulate the SDA with two depthwise convolutions (the second is a transposed depthwise convolution). The projections for MHSA are realized with pointwise convolutions. The $DW\downarrow_n$ means that the resolution is downscaled by the factor $n$ and $DW\uparrow_n$ that it is upscaled by $n$.
  • Figure 5: Structure of the fused and unfused MBConv block. $C$ refers to the channel dimension. In Conv and DWConv the "↓n" refers to a potential stride. Both have an expansion factor of 4 in this figure.
  • ...and 1 more figures