Table of Contents
Fetching ...

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

Cristian Ramírez, Adrián Castelló, Héctor Martínez, Enrique S. Quintana-Ortí

TL;DR

The paper addresses the challenge of optimizing the $GEMM$ kernel for deep learning at the edge across heterogeneous IoT processors. It introduces a $GotoBLAS2$/BLIS–inspired performance simulator that models data transfers across memory hierarchy to evaluate blocked $GEMM$ variants on a GAP8–style edge device. Calibration experiments yield transfer and arithmetic rates, enabling runtime predictions with relative errors under $2\%$ for tested cases. Comparative analysis reveals how micro-kernel shape and block placement influence performance, with no single best variant across all layers, underscoring heterogeneity in edge workloads. The tool provides actionable guidance for architecture-aware $GEMM$ optimization and informs future extensions to cache-based models and multi-core edge platforms.

Abstract

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the amount of data transfers across the memory hierarchy of different algorithmic variants of the kernel. %Armed with this tool, A small collection of experiments provide the necessary data to calibrate the simulator and deliver highly accurate estimations of the execution time for a given processor architecture.

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge

TL;DR

The paper addresses the challenge of optimizing the kernel for deep learning at the edge across heterogeneous IoT processors. It introduces a /BLIS–inspired performance simulator that models data transfers across memory hierarchy to evaluate blocked variants on a GAP8–style edge device. Calibration experiments yield transfer and arithmetic rates, enabling runtime predictions with relative errors under for tested cases. Comparative analysis reveals how micro-kernel shape and block placement influence performance, with no single best variant across all layers, underscoring heterogeneity in edge workloads. The tool provides actionable guidance for architecture-aware optimization and informs future extensions to cache-based models and multi-core edge platforms.

Abstract

The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the amount of data transfers across the memory hierarchy of different algorithmic variants of the kernel. %Armed with this tool, A small collection of experiments provide the necessary data to calibrate the simulator and deliver highly accurate estimations of the execution time for a given processor architecture.
Paper Structure (9 sections, 6 figures, 2 tables)

This paper contains 9 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The baseline algorithm of gemm. Here $C_c$ is a notation artifact, introduced to ease the presentation of the algorithm while $A_c$ and $B_c$ are actual buffers that maintain copies of certain blocks of $A$ and $B$.
  • Figure 2: Packing in the baseline algorithm of gemm. Note how the entries of $A,B$ are re-organized into $A_c,B_c$ in micro-panels of $m_r$ rows, $n_r$ columns, respectively.
  • Figure 3: Variants of the family of algorithms for gemm with $A$ resident in the processor registers: C3B2A0 (top) and B3C2A0 (bottom).
  • Figure 4: Distribution of costs among the different components of the B3C2A0 algorithm using micro-kernels of dimension $4\times 4$, $4\times 8$, and $4\times 12$. The labels starting with "E" and "T" below each bar distinguish between results from experimentation and the simulator, respectively.
  • Figure 5: Execution time of the three algorithms for the gemm in layer #10 of MobileNetV1 estimated using the performance simulator calibrated for the GAP8.
  • ...and 1 more figures