Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators

Timur Ibrayev; Isha Garg; Indranil Chakraborty; Kaushik Roy

Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators

Timur Ibrayev, Isha Garg, Indranil Chakraborty, Kaushik Roy

TL;DR

This paper motivates crossbar-attuned pruning to target ADC-specific inefficiencies by identifying three key properties that induce sparsity that can be utilized to reduce ADC energy without sacrificing accuracy.

Abstract

Deep learning has proved successful in many applications but suffers from high computational demands and requires custom accelerators for deployment. Crossbar-based analog in-memory architectures are attractive for acceleration of deep neural networks (DNN), due to their high data reuse and high efficiency enabled by combining storage and computation in memory. However, they require analog-to-digital converters (ADCs) to communicate crossbar outputs. ADCs consume a significant portion of energy and area of every crossbar processing unit, thus diminishing the potential efficiency benefits. Pruning is a well-studied technique to improve the efficiency of DNNs but requires modifications to be effective for crossbars. In this paper, we motivate crossbar-attuned pruning to target ADC-specific inefficiencies. This is achieved by identifying three key properties (dubbed D.U.B.) that induce sparsity that can be utilized to reduce ADC energy without sacrificing accuracy. The first property ensures that sparsity translates effectively to hardware efficiency by restricting sparsity levels to Discrete powers of 2. The other 2 properties encourage columns in the same crossbar to achieve both Unstructured and Balanced sparsity in order to amortize the accuracy drop. The desired D.U.B. sparsity is then achieved by regularizing the variance of $L_{0}$ norms of neighboring columns within the same crossbar. Our proposed implementation allows it to be directly used in end-to-end gradient-based training. We apply the proposed algorithm to convolutional layers of VGG11 and ResNet18 models, trained on CIFAR-10 and ImageNet datasets, and achieve up to 7.13x and 1.27x improvement, respectively, in ADC energy with less than 1% drop in accuracy.

Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators

TL;DR

Abstract

norms of neighboring columns within the same crossbar. Our proposed implementation allows it to be directly used in end-to-end gradient-based training. We apply the proposed algorithm to convolutional layers of VGG11 and ResNet18 models, trained on CIFAR-10 and ImageNet datasets, and achieve up to 7.13x and 1.27x improvement, respectively, in ADC energy with less than 1% drop in accuracy.

Paper Structure (26 sections, 5 equations, 5 figures, 1 table)

This paper contains 26 sections, 5 equations, 5 figures, 1 table.

Introduction
Background and Related Works
Crossbar-based analog in-memory processing
Pruning methods
Neural network sparsity
Pruning for crossbars
DUB sparsity for ADC efficiency
Discretized (D) sparsity
Unstructured (U) sparsity
Balanced (B) sparsity
Methodology
Algorithm Overview
Training for intra-tile U and B sparsity
Achieving U and B sparsity
Challenge of estimating training-time sparsity
...and 11 more sections

Figures (5)

Figure 1: Intuition behind discretized, unstructured, and balanced (D.U.B.) properties illustrated through column-wise weight distributions and sparsity patterns of $8\times4$ crossbar by default requiring ADC with precision of $N$ bits. Here, LSC denotes the sparsity (number of zeros) of the least sparse column.
Figure 2: (a) Example illustrating how a convolutional layer of DNN is implemented as crossbar-based matrix-vector multiplication. (b) The logical crossbar (tile) structure that is usually used in machine-learning accelerators.
Figure 3: Example of training a $8\times4$ tile for U and B sparsity. Note how gradients due to variance $gVar_c = [g_{c1},g_{c2},g_{c3},g_{c4}]$ are zeroed out for columns 2 and 4 after passing gradient gate $\partial G(\cdot)/\partial H_s$ leaving balance regulating force only on columns 1 and 3, which have $L_{0}$ greater than $\mu^t$.
Figure 4: Tile distribution based on their the least sparse columns obtained by different pruning methods applied to (a) VGG11 network trained on CIFAR10 and (b) ResNet18 network trained on ImageNet.
Figure 5: (a) Normalized ADC savings and (b) fraction of entirely removed tiles for different tile sizes by different pruning methods on a VGG11 network trained on the CIFAR10 dataset.

Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators

TL;DR

Abstract

Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators

Authors

TL;DR

Abstract

Table of Contents

Figures (5)