Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

Rishav Mukherji; Mark Schöne; Khaleelulla Khan Nazeer; Christian Mayr; Anand Subramoney

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney

TL;DR

This work demonstrates that activity sparsity can compose multiplicatively with parameter sparsity in a recurrent neural network model based on the GRU that is designed to be activity sparse, and provides strong evidence that making deep learning models activity sparse and porting them to neuromorphic devices can be a viable strategy that does not compromise on task performance.

Abstract

Artificial neural networks open up unprecedented machine learning capabilities at the cost of ever growing computational requirements. Sparsifying the parameters, often achieved through weight pruning, has been identified as a powerful technique to compress the number of model parameters and reduce the computational operations of neural networks. Yet, sparse activations, while omnipresent in both biological neural networks and deep learning systems, have not been fully utilized as a compression technique in deep learning. Moreover, the interaction between sparse activations and weight pruning is not fully understood. In this work, we demonstrate that activity sparsity can compose multiplicatively with parameter sparsity in a recurrent neural network model based on the GRU that is designed to be activity sparse. We achieve up to $20\times$ reduction of computation while maintaining perplexities below $60$ on the Penn Treebank language modeling task. This magnitude of reduction has not been achieved previously with solely sparsely connected LSTMs, and the language modeling performance of our model has not been achieved previously with any sparsely activated recurrent neural networks or spiking neural networks. Neuromorphic computing devices are especially good at taking advantage of the dynamic activity sparsity, and our results provide strong evidence that making deep learning models activity sparse and porting them to neuromorphic devices can be a viable strategy that does not compromise on task performance. Our results also drive further convergence of methods from deep learning and neuromorphic computing for efficient machine learning.

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

TL;DR

Abstract

reduction of computation while maintaining perplexities below

on the Penn Treebank language modeling task. This magnitude of reduction has not been achieved previously with solely sparsely connected LSTMs, and the language modeling performance of our model has not been achieved previously with any sparsely activated recurrent neural networks or spiking neural networks. Neuromorphic computing devices are especially good at taking advantage of the dynamic activity sparsity, and our results provide strong evidence that making deep learning models activity sparse and porting them to neuromorphic devices can be a viable strategy that does not compromise on task performance. Our results also drive further convergence of methods from deep learning and neuromorphic computing for efficient machine learning.

Paper Structure (14 sections, 3 equations, 5 figures, 3 tables)

This paper contains 14 sections, 3 equations, 5 figures, 3 tables.

Introduction
Related Work
Efficient Recurrent Neural Networks for Neuromorphic Accelerators
Event-based Gated Recurrent Unit
Sparsely Connected Networks
Efficiency of Sparse Activations and Sparse Connectivity
Results
Weight Sparsity
Activity Sparsity
Discussion
Extended Results
Pruning Methodology
Network Activity
Limitations

Figures (5)

Figure 1: Influence of weight sparsity and activity sparsity on the Penn Treebank and WikiText-2 datasets. A and B: Test perplexity versus reduction of MAC operations through weight sparsity (LSTM) and combined activity and weight sparsity (EGRU). We plot the mean test perplexity and corresponding standard deviation over 15 seeds. C and D: Activity sparsity vs weight sparsity trade-off for EGRU. The marker size is proportional to the number of MAC operations while the colour represents task performance in terms of test perplexity.
Figure 2: We show the effect of weight decay regularization on the performance and activity sparsity of the EGRU. Therefore, we consider separate degrees of weight decay for the weights and bias, separately. All models are trained on WikiText-2. A: Validation perplexity trade-off with weight decay. B: Weight decay on both the weights and the bias reduces the amount of sparse activations. C and D: Effect of weight decay on the distribution of weights and biases for fixed weight decay on the bias of 0.01.
Figure 3: Influence of number of pruning steps on the results for different target weight sparsities. Results are displayed for EGRU experiments on Penn Treebank dataset. Other experimental setups follow similar trends
Figure 4: Histogram of activity of the EGRU neurons in each layer. For example, an activity of 20% denotes that a neuron's output is non-zero 20% of the time, hence saving operations 80% of the time.
Figure 5: Histogram of cell state values shifted by the thresholds $\mathbf{c} - \boldsymbol{\vartheta}$ in each layer.

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

TL;DR

Abstract

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (5)