Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models

Rishav Mukherji; Mark Schöne; Khaleelulla Khan Nazeer; Christian Mayr; David Kappel; Anand Subramoney

Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models

Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, David Kappel, Anand Subramoney

TL;DR

This work investigates the joint impact of activity sparsity and connectivity sparsity in a language-modeling context using an event-based GRU (EGRU). By comparing sparsely activated EGRU networks against densely activated LSTM baselines on Penn Treebank and WikiText-2, the study demonstrates that activation sparsity and weight pruning interact largely independently, yielding multiplicative reductions in compute quantified by $MAC$ operations as $\lambda_a \lambda_w$ with minimal performance penalties across a broad sparsity range. The authors show that weight decay can tune network activity, offering a practical lever to meet hardware constraints while preserving accuracy. These findings support the viability of sparsely connected, event-based sequence models for energy-efficient neuromorphic hardware and provide insight into tuning sparsity for practical deployments.

Abstract

Activity and parameter sparsity are two standard methods of making neural networks computationally more efficient. Event-based architectures such as spiking neural networks (SNNs) naturally exhibit activity sparsity, and many methods exist to sparsify their connectivity by pruning weights. While the effect of weight pruning on feed-forward SNNs has been previously studied for computer vision tasks, the effects of pruning for complex sequence tasks like language modeling are less well studied since SNNs have traditionally struggled to achieve meaningful performance on these tasks. Using a recently published SNN-like architecture that works well on small-scale language modeling, we study the effects of weight pruning when combined with activity sparsity. Specifically, we study the trade-off between the multiplicative efficiency gains the combination affords and its effect on task performance for language modeling. To dissect the effects of the two sparsities, we conduct a comparative analysis between densely activated models and sparsely activated event-based models across varying degrees of connectivity sparsity. We demonstrate that sparse activity and sparse connectivity complement each other without a proportional drop in task performance for an event-based neural network trained on the Penn Treebank and WikiText-2 language modeling datasets. Our results suggest sparsely connected event-based neural networks are promising candidates for effective and efficient sequence modeling.

Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models

TL;DR

operations as

with minimal performance penalties across a broad sparsity range. The authors show that weight decay can tune network activity, offering a practical lever to meet hardware constraints while preserving accuracy. These findings support the viability of sparsely connected, event-based sequence models for energy-efficient neuromorphic hardware and provide insight into tuning sparsity for practical deployments.

Abstract

Paper Structure (10 sections, 3 equations, 5 figures, 4 tables)

This paper contains 10 sections, 3 equations, 5 figures, 4 tables.

Introduction
Related Work
Methods
Event-based Gated Recurrent Unit
Sparsely Connected Networks
Efficiency of Sparse Activations and Sparse Connectivity
Results
Joint activity sparsity and connectivity sparsity
Activity sparsity and weight regularization
Discussion

Figures (5)

Figure 1: Sparsely connected artificial neural networks (ANNs) such as LSTMs Hochreiter1997 transfer their entire state for all simulation steps, while event-based neural networks such as EGRU Subramoney2023 only transfer a fraction of their state in each step.
Figure 2: Weight sparsity (pruned LSTM) vs joint activity sparsity and weight sparsity (pruned EGRU). Each point corresponds to either LSTM or EGRU with a increasingly sparse connections from left to right. Both models show a similar performance degradation as connections are removed. Mean test perplexity and corresponding standard deviation over 15 random seeds. The detailed numbers for the best models are presented in tab. \ref{['tab:ptb-results']} and tab. \ref{['tab:wikitext-results']}.
Figure 3: EGRU adjusts its network activity through gradient decent, which leads to different degrees of activity sparsity for corresponding degrees of connectivity sparsity. For a broad range of connectivity sparsity, the network activity remained almost independent. The training process compensates for connection sparsity with more activity in the network only when the connection sparsity was high.
Figure 4: Effect of weight decay regularization on the performance and activity sparsity of the EGRU. We considered multiple degrees of weight decay for the weights and bias, separately. Means and errors are plotted for fixed decay rate on the weights and varying decay rate on biases. All models were trained on the larger WikiText-2 dataset merity2017pointer.
Figure 5: Effect of varying degrees of weight decay applied to the weights on the distribution of weights and biases for fixed weight decay applied to the biasses of 0.01. All models were trained on the larger WikiText-2 dataset merity2017pointer.

Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models

TL;DR

Abstract

Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)