Table of Contents
Fetching ...

On the Efficiency of Convolutional Neural Networks

Andrew Lavin

TL;DR

This work addresses the discrepancy between low arithmetic cost and high latency in modern convnets by introducing a co-optimization framework that unifies model efficiency with computational efficiency. It extends roofline ideas into a waterline model for sequences of kernels and introduces block-fusion kernels and the tensor machine to achieve higher operational intensity without increasing DRAM transfers. The ConvFirst and MBConv block designs, along with ConvFirstNet, demonstrate substantial latency reductions and improved efficiency compared with baselines like ConvNeXt and EfficientNet, illustrating the practical impact of fused kernels and depth-first strategies. The approach provides a practical pathway toward faster, more accurate convnets and offers analytic tools to guide future hardware-aware model design.

Abstract

Since the breakthrough performance of AlexNet in 2012, convolutional neural networks (convnets) have grown into extremely powerful vision models. Deep learning researchers have used convnets to perform vision tasks with accuracy that was unachievable a decade ago. Confronted with the immense computation that convnets use, deep learning researchers also became interested in efficiency. However, the engineers who deployed efficient convnets soon realized that they were slower than the previous generation, despite using fewer operations. Many reverted to older models that ran faster. Hence researchers switched the objective of their search from arithmetic complexity to latency and produced a new wave of models that performed better. Paradoxically, these models also used more operations. Skepticism grew among researchers and engineers alike about the relevance of arithmetic complexity. Contrary to the prevailing view that latency and arithmetic complexity are irreconcilable, a simple formula relates both through computational efficiency. This insight enabled us to co-optimize the separate factors that determine latency. We observed that the degenerate conv2d layers that produce the best accuracy--complexity trade-off also use significant memory resources and have low computational efficiency. We devised block fusion algorithms to implement all the layers of a residual block in a single kernel, thereby creating temporal locality, avoiding communication, and reducing workspace size. Our ConvFirst model with block-fusion kernels has less arithmetic complexity and greater computational efficiency than baseline models and kernels, and ran approximately four times as fast as ConvNeXt. We also created novel tools, including efficiency gap plots and waterline analysis. Our unified approach to convnet efficiency envisions a new era of models and kernels that achieve greater accuracy at lower cost.

On the Efficiency of Convolutional Neural Networks

TL;DR

This work addresses the discrepancy between low arithmetic cost and high latency in modern convnets by introducing a co-optimization framework that unifies model efficiency with computational efficiency. It extends roofline ideas into a waterline model for sequences of kernels and introduces block-fusion kernels and the tensor machine to achieve higher operational intensity without increasing DRAM transfers. The ConvFirst and MBConv block designs, along with ConvFirstNet, demonstrate substantial latency reductions and improved efficiency compared with baselines like ConvNeXt and EfficientNet, illustrating the practical impact of fused kernels and depth-first strategies. The approach provides a practical pathway toward faster, more accurate convnets and offers analytic tools to guide future hardware-aware model design.

Abstract

Since the breakthrough performance of AlexNet in 2012, convolutional neural networks (convnets) have grown into extremely powerful vision models. Deep learning researchers have used convnets to perform vision tasks with accuracy that was unachievable a decade ago. Confronted with the immense computation that convnets use, deep learning researchers also became interested in efficiency. However, the engineers who deployed efficient convnets soon realized that they were slower than the previous generation, despite using fewer operations. Many reverted to older models that ran faster. Hence researchers switched the objective of their search from arithmetic complexity to latency and produced a new wave of models that performed better. Paradoxically, these models also used more operations. Skepticism grew among researchers and engineers alike about the relevance of arithmetic complexity. Contrary to the prevailing view that latency and arithmetic complexity are irreconcilable, a simple formula relates both through computational efficiency. This insight enabled us to co-optimize the separate factors that determine latency. We observed that the degenerate conv2d layers that produce the best accuracy--complexity trade-off also use significant memory resources and have low computational efficiency. We devised block fusion algorithms to implement all the layers of a residual block in a single kernel, thereby creating temporal locality, avoiding communication, and reducing workspace size. Our ConvFirst model with block-fusion kernels has less arithmetic complexity and greater computational efficiency than baseline models and kernels, and ran approximately four times as fast as ConvNeXt. We also created novel tools, including efficiency gap plots and waterline analysis. Our unified approach to convnet efficiency envisions a new era of models and kernels that achieve greater accuracy at lower cost.
Paper Structure (34 sections, 38 equations, 18 figures, 5 tables)

This paper contains 34 sections, 38 equations, 18 figures, 5 tables.

Figures (18)

  • Figure : Efficiency gap plots compare ideal and actual latency and quantify the difference as computational efficiency. (a). Dividing arithmetic complexity (MACs) by peak arithmetic throughput yields the ideal latency. (b). Inference time on the GPU is the actual latency. (c). The ideal latency of ConvNeXt is longer than the actual latency of ConvFirst. (d). Low computational efficiency causes a wide gap between ideal and actual latency. EfficientNet ranges from 5% -- 8% and has a wide gap. Our ConvFirst model with block-fusion kernels ranges from 47% -- 55% and has a narrow gap. We used an NVIDIA A5000 GPU with and batch size $128$.
  • Figure A1: Efficiency gap plots illustrate the difference between ideal and actual performance for different models and software. They help us understand the separate contributions of model efficiency and computational efficiency. This figure shows ImageNet-1K classification accuracy and latency for EfficientNet and ConvNeXt using an NVIDIA Ampere A5000 GPU with $76.7$ TFLOP/s peak arithmetic throughput running PyTorch Inductor software. Batch size equals $128$. (a).Model efficiency measures accuracy as a function of the number of multiply-accumulate operations (MACs) performed. Dividing MACs by peak arithmetic throughput of the processor yields ideal latency, the lowest possible latency for the number of operations. (b).Actual efficiency measures accuracy versus actual latency for a combination of model and software. (c). Overlaying model and actual efficiency graphs reveals the efficiency gap, the offset between ideal and actual latency. (d). Measured with a logarithmic scale on the latency axis, the (negative) width of the efficiency gap equals the logarithm of computational efficiency, the ratio between actual and ideal performance. EfficientNet's poor computational efficiency creates a wide efficiency gap, resulting in longer latency than ConvNeXt, despite superior model efficiency.
  • Figure A2: Evolution of convnet blocks and convolutional layers.(a). ResNet34 used residual blocks with layers: Conv($3 \times 3$). (b). ResNet50 added bottleneck blocks with an expansion ratio equal to four using point-wise convolutions: Conv4($1 \times 1$). (c). ResNeXT101_32x4d used grouped-convolutions with group-width equal to 32: Conv32($3 \times 3$). (d). MobileNetV2 used inverted residual blocks with depth-wise convolutions: Conv1($3 \times 3$). (e). Operational intensity of convnet layers as a function of the number of channels. The Conv($3 \times 3$) layers used by early convnets had large operational intensity. Point-wise, grouped, and depth-wise convolutions progressively decreased the operational intensity of layers.
  • Figure A3: Waterline analysis of baseline models. These plots compare the operational intensity of individual layers with the op:byte waterline of the NVIDIA Ampere A5000 GPU. Single-layer kernels are often memory bound because their low operational intensity is "underwater." Memory bound kernels have higher attainable latency and lesser attainable computational efficiency (max efficiency). Again, we used arithmetic and batch size equal to 128.
  • Figure A4: Comparison of the attainable computational efficiency calculated by the waterline and roofline performance models. Roofline was originally intended as a performance model for a single parallel kernel williams2009roofline. Hence roofline overestimates the attainable efficiency for a sequence of kernels, if some of the kernels are compute bound and others are memory bound. Waterline is accurate regardless, because it measures how each kernel contributes to the minimum latency of the sequence. ConvFirst is our new model. See Section \ref{['s:ConFirstNet']} for details.
  • ...and 13 more figures