Table of Contents
Fetching ...

Efficient Deployment of CNN Models on Multiple In-Memory Computing Units

Eleni Bougioukou, Theodore Antonakopoulos

TL;DR

The paper addresses efficient CNN inference on hybrid in-memory computing platforms by tackling the node-to-PU mapping problem. It introduces the Load Balance Longest Path (LBLP) algorithm and evaluates it on the In-Memory Computing Emulator (IMCE) against baseline strategies, demonstrating superior processing rate and lower latency across ResNet8, ResNet18, and YOLOv8n. By balancing load and preserving parallelism between branches, LBLP achieves substantial throughput gains (e.g., >2x for ResNet18) while maintaining good PU utilization. The work highlights the importance of task mapping in IMC/DPU systems for practical, energy-efficient edge-to-cloud AI deployment and outlines future hardware integration experiments. The findings contribute a scalable, low-complexity scheduling approach that can guide design and optimization of next-generation IMC-enabled inference engines.

Abstract

In-Memory Computing (IMC) represents a paradigm shift in deep learning acceleration by mitigating data movement bottlenecks and leveraging the inherent parallelism of memory-based computations. The efficient deployment of Convolutional Neural Networks (CNNs) on IMC-based hardware necessitates the use of advanced task allocation strategies for achieving maximum computational efficiency. In this work, we exploit an IMC Emulator (IMCE) with multiple Processing Units (PUs) for investigating how the deployment of a CNN model in a multi-processing system affects its performance, in terms of processing rate and latency. For that purpose, we introduce the Load-Balance-Longest-Path (LBLP) algorithm, that dynamically assigns all CNN nodes to the available IMCE PUs, for maximizing the processing rate and minimizing latency due to efficient resources utilization. We are benchmarking LBLP against other alternative scheduling strategies for a number of CNN models and experimental results demonstrate the effectiveness of the proposed algorithm.

Efficient Deployment of CNN Models on Multiple In-Memory Computing Units

TL;DR

The paper addresses efficient CNN inference on hybrid in-memory computing platforms by tackling the node-to-PU mapping problem. It introduces the Load Balance Longest Path (LBLP) algorithm and evaluates it on the In-Memory Computing Emulator (IMCE) against baseline strategies, demonstrating superior processing rate and lower latency across ResNet8, ResNet18, and YOLOv8n. By balancing load and preserving parallelism between branches, LBLP achieves substantial throughput gains (e.g., >2x for ResNet18) while maintaining good PU utilization. The work highlights the importance of task mapping in IMC/DPU systems for practical, energy-efficient edge-to-cloud AI deployment and outlines future hardware integration experiments. The findings contribute a scalable, low-complexity scheduling approach that can guide design and optimization of next-generation IMC-enabled inference engines.

Abstract

In-Memory Computing (IMC) represents a paradigm shift in deep learning acceleration by mitigating data movement bottlenecks and leveraging the inherent parallelism of memory-based computations. The efficient deployment of Convolutional Neural Networks (CNNs) on IMC-based hardware necessitates the use of advanced task allocation strategies for achieving maximum computational efficiency. In this work, we exploit an IMC Emulator (IMCE) with multiple Processing Units (PUs) for investigating how the deployment of a CNN model in a multi-processing system affects its performance, in terms of processing rate and latency. For that purpose, we introduce the Load-Balance-Longest-Path (LBLP) algorithm, that dynamically assigns all CNN nodes to the available IMCE PUs, for maximizing the processing rate and minimizing latency due to efficient resources utilization. We are benchmarking LBLP against other alternative scheduling strategies for a number of CNN models and experimental results demonstrate the effectiveness of the proposed algorithm.

Paper Structure

This paper contains 10 sections, 4 figures, 1 table, 2 algorithms.

Figures (4)

  • Figure 1: The IMCE Architecture
  • Figure 2: ResNet8: Normalized Processing Rate (a) and Latency (b) vs Number of PUs for various allocation methods
  • Figure 3: ResNet18: Normalized Processing Rate (a) and Latency (b) vs Number of PUs for various allocation methods.
  • Figure 4: ResNet18: Normalized Processing rate (a) and latency (b) vs Number of PUs for different number of DPUs