Table of Contents
Fetching ...

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

Josse Van Delm, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, Marian Verhelst

TL;DR

HTVM tackles deploying DNNs on heterogeneous TinyML SoCs with limited memory by fusing TVM's flexible codegen with DORY's memory-aware tiling in an accelerator-aware, ahead-of-time flow. It uses accelerator-aware pattern matching to dispatch work to digital and analog accelerators on DIANA and relies on a BYOC backend to generate optimized accelerator code while managing data movement. The approach yields large end-to-end speedups, substantial binary-size reductions, and near-peak accelerator performance, demonstrated through MLPerf Tiny benchmarks on a real heterogeneous platform. This work provides an open-source, scalable path for deploying diverse neural networks on mixed-architecture edge devices without online autotuning.

Abstract

Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment.

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

TL;DR

HTVM tackles deploying DNNs on heterogeneous TinyML SoCs with limited memory by fusing TVM's flexible codegen with DORY's memory-aware tiling in an accelerator-aware, ahead-of-time flow. It uses accelerator-aware pattern matching to dispatch work to digital and analog accelerators on DIANA and relies on a BYOC backend to generate optimized accelerator code while managing data movement. The approach yields large end-to-end speedups, substantial binary-size reductions, and near-peak accelerator performance, demonstrated through MLPerf Tiny benchmarks on a real heterogeneous platform. This work provides an open-source, scalable path for deploying diverse neural networks on mixed-architecture edge devices without online autotuning.

Abstract

Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment.
Paper Structure (14 sections, 2 equations, 5 figures, 2 tables)

This paper contains 14 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: HTVM compilation flow.
  • Figure 2: Time diagram of a neural network deployed with HTVM.
  • Figure 3: Measurement setup and DIANA architecture from ueyoshi2022diana.
  • Figure 4: Latency effect of tiling with accelerator-aware heuristics for decreasing L1 memory budget to execute different layers on DIANA's digital accelerator.
  • Figure 5: Single layer overhead characterization on digital and analog accelerators with Conv2D, FC, and DWConv2D layer types, evaluated for different geometries. For the analog layers, a distinction is made between scaling the channels, or the spatial dimension to explore different geometries. For the digital layers, we explore spatial scaling with Conv2D, and channel scaling with FC layers.