Table of Contents
Fetching ...

CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures

Rebecca Pelke, Jose Cubero-Cascante, Nils Bosbach, Felix Staudigl, Rainer Leupers, Jan Moritz Joseph

TL;DR

CLSA-CIM tackles the data-movement bottleneck in computing-in-memory (CIM) architectures by introducing a cross-layer scheduling framework that operates on tiled RRAM CIM with weight-stationary data flow. The approach integrates with existing weight-duplication and intra-layer scheduling, and formalizes a four-stage cross-layer scheduling process (determine sets, determine dependencies, intra-layer scheduling, cross-layer scheduling) plus an option to combine with weight duplication. Case studies and simulations show substantial improvements in PE utilization and inference speed, with up to 29.2x speedups over state-of-the-art scheduling. The work provides a software-based method to exploit cross-layer opportunities in CIM, enabling significant performance gains for data-intensive ML workloads while highlighting practical limitations and directions for future retargetability.

Abstract

The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.

CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures

TL;DR

CLSA-CIM tackles the data-movement bottleneck in computing-in-memory (CIM) architectures by introducing a cross-layer scheduling framework that operates on tiled RRAM CIM with weight-stationary data flow. The approach integrates with existing weight-duplication and intra-layer scheduling, and formalizes a four-stage cross-layer scheduling process (determine sets, determine dependencies, intra-layer scheduling, cross-layer scheduling) plus an option to combine with weight duplication. Case studies and simulations show substantial improvements in PE utilization and inference speed, with up to 29.2x speedups over state-of-the-art scheduling. The work provides a software-based method to exploit cross-layer opportunities in CIM, enabling significant performance gains for data-intensive ML workloads while highlighting practical limitations and directions for future retargetability.

Abstract

The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-mapping strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA.
Paper Structure (21 sections, 3 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: nn inference on (a) tiled cim architectures: (b) Layer-by-layer scheduling, (c) weight duplication mapping, and (d) cross-layer scheduling
  • Figure 2: Partitioning, quantization (Q), and bn folding. The resulting canonical nn representation is split into base (green) and non-base (blue) layers
  • Figure 3: Conv2D to GEMM transformation using im2col
  • Figure 4: Implementation of weight duplication using three duplicates
  • Figure 5: Minimal example for CLSA-CIM using two consecutive Conv2D layers and a non-base layer path including bias, activation, pooling, and padding: Determine sets (a), determine dependencies (b), and cross-layer scheduling (c)
  • ...and 2 more figures