Pack my weights and run! Minimizing overheads for in-memory computing accelerators

Pouya Houshmand; Marian Verhelst

Pack my weights and run! Minimizing overheads for in-memory computing accelerators

Pouya Houshmand, Marian Verhelst

TL;DR

This paper proposes a novel mapping algorithm for the weights in the IMC macro, based on efficient packing of the weights of network layers in the available memory, which realizes minimization of weight loading times while at the same time maximally exploiting the parallelism of the IMC computational fabric.

Abstract

In-memory computing hardware accelerators allow more than 10x improvements in peak efficiency and performance for matrix-vector multiplications (MVM) compared to conventional digital designs. For this, they have gained great interest for the acceleration of neural network workloads. Nevertheless, these potential gains are only achieved when the utilization of the computational resources is maximized and the overhead from loading operands in the memory array minimized. To this aim, this paper proposes a novel mapping algorithm for the weights in the IMC macro, based on efficient packing of the weights of network layers in the available memory. The algorithm realizes 1) minimization of weight loading times while at the same time 2) maximally exploiting the parallelism of the IMC computational fabric. A set of case studies are carried out to show achievable trade-offs for the MLPerf Tiny benchmark \cite{mlperftiny} on IMC architectures, with potential $10-100\times$ EDP improvements.

Pack my weights and run! Minimizing overheads for in-memory computing accelerators

TL;DR

Abstract

EDP improvements.

Paper Structure (15 sections, 1 equation, 9 figures, 1 table)

This paper contains 15 sections, 1 equation, 9 figures, 1 table.

Introduction
Background and overview
Dataflow concepts for IMC
Motivation
Weight loading overhead
Underutilization of available computational parallelism
Weight Packing Algorithm
Tile generation
SuperTile Generation
Column generation
Column allocation to macros
Case studies
Weight mapping methods comparison
Impact of weight loading and $D_h$
Conclusion

Figures (9)

Figure 1: Weight reloading is a major energy and latency overhead in IMC computation for DNN workloads; the target of this work is to minimize its impact and maximize stationarity by packing efficiently the weights in the IMC array.
Figure 2: a) IMC template and its 4D design space. b) DNN operations and the weight stationary mapping of the weights in IMC
Figure 3: SRAM density increases proportionally with $D_m$; the contribution of multipliers and peripherals is amortized as we increase the number of cells per multiplier. This is adopted for both digital (D-IMC) and analog (A-IMC) designs.
Figure 4: Weight tile pool generation steps
Figure 5: Steps successive to the weight tile pool generation: a) Supertile pool generation, b) column pool generation and finally 3) the allocation of columns to IMC macros.
...and 4 more figures

Pack my weights and run! Minimizing overheads for in-memory computing accelerators

TL;DR

Abstract

Pack my weights and run! Minimizing overheads for in-memory computing accelerators

Authors

TL;DR

Abstract

Table of Contents

Figures (9)