Table of Contents
Fetching ...

Benchmarking Ultra-Low-Power $μ$NPUs

Josh Millar, Yushan Huang, Sarab Sethi, Hamed Haddadi, Anil Madhavapeddy

TL;DR

This work addresses the challenge of reliably evaluating microcontroller-scale neural processing units (μNPUs) for on-device inference. It introduces an open-source model compilation pipeline and performs side-by-side, independent benchmarks across multiple commercially-available μNPU platforms under a unified workload suite. Key findings reveal that memory I/O and initialization overhead often dominate end-to-end latency and energy, with some platforms offering orders-of-magnitude energy efficiency over general-purpose MCUs, while others excel in latency or model capacity. The results provide actionable guidance for hardware and software developers and establish a foundation for standardized, ongoing evaluation in this rapidly evolving space.

Abstract

Efficient on-device neural network (NN) inference offers predictable latency, improved privacy and reliability, and lower operating costs for vendors than cloud-based inference. This has sparked recent development of microcontroller-scale NN accelerators, also known as neural processing units ($μ$NPUs), designed specifically for ultra-low-power applications. We present the first comparative evaluation of a number of commercially-available $μ$NPUs, including the first independent benchmarks for multiple platforms. To ensure fairness, we develop and open-source a model compilation pipeline supporting consistent benchmarking of quantized models across diverse microcontroller hardware. Our resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including certain $μ$NPUs exhibiting unexpected scaling behaviors with model complexity. This work provides a foundation for ongoing evaluation of $μ$NPU platforms, alongside offering practical insights for both hardware and software developers in this rapidly evolving space.

Benchmarking Ultra-Low-Power $μ$NPUs

TL;DR

This work addresses the challenge of reliably evaluating microcontroller-scale neural processing units (μNPUs) for on-device inference. It introduces an open-source model compilation pipeline and performs side-by-side, independent benchmarks across multiple commercially-available μNPU platforms under a unified workload suite. Key findings reveal that memory I/O and initialization overhead often dominate end-to-end latency and energy, with some platforms offering orders-of-magnitude energy efficiency over general-purpose MCUs, while others excel in latency or model capacity. The results provide actionable guidance for hardware and software developers and establish a foundation for standardized, ongoing evaluation in this rapidly evolving space.

Abstract

Efficient on-device neural network (NN) inference offers predictable latency, improved privacy and reliability, and lower operating costs for vendors than cloud-based inference. This has sparked recent development of microcontroller-scale NN accelerators, also known as neural processing units (NPUs), designed specifically for ultra-low-power applications. We present the first comparative evaluation of a number of commercially-available NPUs, including the first independent benchmarks for multiple platforms. To ensure fairness, we develop and open-source a model compilation pipeline supporting consistent benchmarking of quantized models across diverse microcontroller hardware. Our resulting analysis uncovers both expected performance trends as well as surprising disparities between hardware specifications and actual performance, including certain NPUs exhibiting unexpected scaling behaviors with model complexity. This work provides a foundation for ongoing evaluation of NPU platforms, alongside offering practical insights for both hardware and software developers in this rapidly evolving space.

Paper Structure

This paper contains 29 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: typical $\mu$NPU hardware architecture
  • Figure 2: the various $\mu$NPUs used in our benchmark, and how they compare in terms of max GOPS, peak power draw, and theoretical efficiency (GOPS/mW).
  • Figure 3: an overview of our model compilation workflow.
  • Figure 4: power trace of YoloV1 inference on HX-WE2.
  • Figure 5: latency for each stage, model, and platform.
  • ...and 2 more figures