Table of Contents
Fetching ...

Balanced segmentation of CNNs for multi-TPU inference

Jorge Villarrubia, Luis Costero, Francisco D. Igual, Katzalin Olcoz

TL;DR

This work tackles the memory bottlenecks and workload imbalance that arise when performing CNN inference on edge devices equipped with multiple Edge TPUs. It first characterizes single-TPU performance and then analyzes segmentation strategies, introducing a balanced segmentation pipeline (Segm_Balanced) that combines depth-aware partitioning with workload balancing and a refinement step to minimize host memory usage. Compared to the vendor's compiler-based Segm_Comp and to profiling-based Segm_Prof, Segm_Balanced delivers up to $2.60\times$ speedups over Segm_Comp and can achieve super-linear improvements versus a single TPU, validating multi-TPU inference as a practical path for edge CNN inference. The approach shows strong promise for enabling efficient, scalable edge AI workloads with limited on-chip memory.

Abstract

In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google's Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to 2.60x compared with the segmentation offered by the compiler targeting multiple TPUs.

Balanced segmentation of CNNs for multi-TPU inference

TL;DR

This work tackles the memory bottlenecks and workload imbalance that arise when performing CNN inference on edge devices equipped with multiple Edge TPUs. It first characterizes single-TPU performance and then analyzes segmentation strategies, introducing a balanced segmentation pipeline (Segm_Balanced) that combines depth-aware partitioning with workload balancing and a refinement step to minimize host memory usage. Compared to the vendor's compiler-based Segm_Comp and to profiling-based Segm_Prof, Segm_Balanced delivers up to speedups over Segm_Comp and can achieve super-linear improvements versus a single TPU, validating multi-TPU inference as a practical path for edge CNN inference. The approach shows strong promise for enabling efficient, scalable edge AI workloads with limited on-chip memory.

Abstract

In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google's Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to 2.60x compared with the segmentation offered by the compiler targeting multiple TPUs.

Paper Structure

This paper contains 23 sections, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Example of a 3x3 systolic array and the cycle-by-cycle data flow through the chains.
  • Figure 2: Average performance of inference for synthetic and real models (in TOPS) after 50 repetitions using batch size 1, as a function of the model size.
  • Figure 4: Performance of the synthetic models (blue curve associated with the left vertical axis) and their device and host memory usage (yellow and red curves associated with the right vertical axis).
  • Figure 5: Top: Single TPU execution of a model with layers stored on the host. Bottom: Pipelined execution of the model, segmented into 3 TPUs, without layers stored in host memory.
  • Figure 6: Speedup of synthetic models using Segm_Comp, segmented into $2$, $3$ and $4$ TPUs run on a $15$-input batch, versus execution on a single TPU.
  • ...and 4 more figures