Table of Contents
Fetching ...

Benchmarking Deep Learning Convolutions on Energy-constrained CPUs

Enrique Galvez, Adrien Cassagne, Alix Munier, Manuel Bouyer

TL;DR

This work benchmarks CNN convolutions on energy-constrained CPUs across ARM, Intel, AMD and Nvidia platforms, focusing on direct, GEMM-based lowering, and Winograd convolutions for small batches and typical kernels. A novel high-resolution socket-level power measurement system, named dalek, enables fine-grained energy profiling of CNN inference. Key findings show that Winograd and GEMM approaches are consistently energy-efficient across architectures, while full-inference benefits from GEMM data management; Nvidia Jetson AGX Orin offers the best latency–power balance. The results provide practical guidance for energy-aware embedded deployment and establish a robust methodology for cross-platform CPU convolution benchmarking.

Abstract

This work evaluates state-of-the-art convolution algorithms for CPU-based deep learning inference. While most prior studies focus on GPUs or NPUs, CPU implementations remain relatively underoptimized. We benchmark direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM __ , Intel __ , AMD __ , Apple __ , and Nvidia __ , considering both latency and energy efficiency. Our results highlight the key architectural factors that govern CPU efficiency for convolution operations, providing practical guidance for energy-aware embedded deployment. As a main results of this work, the Nvidia __ AGX Orin combined with the GEMM algorithm achieves the best trade-off between inference latency and energy consumption.

Benchmarking Deep Learning Convolutions on Energy-constrained CPUs

TL;DR

This work benchmarks CNN convolutions on energy-constrained CPUs across ARM, Intel, AMD and Nvidia platforms, focusing on direct, GEMM-based lowering, and Winograd convolutions for small batches and typical kernels. A novel high-resolution socket-level power measurement system, named dalek, enables fine-grained energy profiling of CNN inference. Key findings show that Winograd and GEMM approaches are consistently energy-efficient across architectures, while full-inference benefits from GEMM data management; Nvidia Jetson AGX Orin offers the best latency–power balance. The results provide practical guidance for energy-aware embedded deployment and establish a robust methodology for cross-platform CPU convolution benchmarking.

Abstract

This work evaluates state-of-the-art convolution algorithms for CPU-based deep learning inference. While most prior studies focus on GPUs or NPUs, CPU implementations remain relatively underoptimized. We benchmark direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM __ , Intel __ , AMD __ , Apple __ , and Nvidia __ , considering both latency and energy efficiency. Our results highlight the key architectural factors that govern CPU efficiency for convolution operations, providing practical guidance for energy-aware embedded deployment. As a main results of this work, the Nvidia __ AGX Orin combined with the GEMM algorithm achieves the best trade-off between inference latency and energy consumption.

Paper Structure

This paper contains 5 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Energy consumption measurement: MSRs vs socket measure.
  • Figure 2: Convolution energy consumption (MB1_IC64IH56_OC64OH56_KH3PH1).
  • Figure 3: Latency and instantaneous power depending on architecture, multithrading and algorithm for a full inference of ResNet50v1.5 .