Table of Contents
Fetching ...

Investigating Sparsity in Recurrent Neural Networks

Harshil Darji

TL;DR

The work analyzes sparsity in Recurrent Neural Networks through two complementary approaches: pruning existing connections and embedding sparsity via randomly generated graphs. It systematically evaluates four RNN variants (RNN-Tanh, RNN-ReLU, LSTM, GRU) on Reber grammar data, quantifying how much pruning can be tolerated and how quickly accuracy can be recovered. The study then constructs Sparse-RNNs from Watts–Strogatz and Barabási–Albert graphs, examining how graph properties correlate with performance and whether these features can predict outcomes with regression models. Key findings show that substantial pruning (often >60%) can be tolerated with fast recovery (often after a single retraining epoch), while randomly structured networks reveal meaningful links between topology and accuracy; GRU in particular yields strong predictability of performance from graph features, supporting graph-guided neural architecture search. Overall, the results substantiate sparsity as a viable route to efficient RNNs and offer practical insights for sparsity strategies and NAS-inspired design.

Abstract

In the past few years, neural networks have evolved from simple Feedforward Neural Networks to more complex neural networks, such as Convolutional Neural Networks and Recurrent Neural Networks. Where CNNs are a perfect fit for tasks where the sequence is not important such as image recognition, RNNs are useful when order is important such as machine translation. An increasing number of layers in a neural network is one way to improve its performance, but it also increases its complexity making it much more time and power-consuming to train. One way to tackle this problem is to introduce sparsity in the architecture of the neural network. Pruning is one of the many methods to make a neural network architecture sparse by clipping out weights below a certain threshold while keeping the performance near to the original. Another way is to generate arbitrary structures using random graphs and embed them between an input and output layer of an Artificial Neural Network. Many researchers in past years have focused on pruning mainly CNNs, while hardly any research is done for the same in RNNs. The same also holds in creating sparse architectures for RNNs by generating and embedding arbitrary structures. Therefore, this thesis focuses on investigating the effects of the before-mentioned two techniques on the performance of RNNs. We first describe the pruning of RNNs, its impact on the performance of RNNs, and the number of training epochs required to regain accuracy after the pruning is performed. Next, we continue with the creation and training of Sparse Recurrent Neural Networks and identify the relation between the performance and the graph properties of its underlying arbitrary structure. We perform these experiments on RNN with Tanh nonlinearity (RNN-Tanh), RNN with ReLU nonlinearity (RNN-ReLU), GRU, and LSTM. Finally, we analyze and discuss the results achieved from both the experiments.

Investigating Sparsity in Recurrent Neural Networks

TL;DR

The work analyzes sparsity in Recurrent Neural Networks through two complementary approaches: pruning existing connections and embedding sparsity via randomly generated graphs. It systematically evaluates four RNN variants (RNN-Tanh, RNN-ReLU, LSTM, GRU) on Reber grammar data, quantifying how much pruning can be tolerated and how quickly accuracy can be recovered. The study then constructs Sparse-RNNs from Watts–Strogatz and Barabási–Albert graphs, examining how graph properties correlate with performance and whether these features can predict outcomes with regression models. Key findings show that substantial pruning (often >60%) can be tolerated with fast recovery (often after a single retraining epoch), while randomly structured networks reveal meaningful links between topology and accuracy; GRU in particular yields strong predictability of performance from graph features, supporting graph-guided neural architecture search. Overall, the results substantiate sparsity as a viable route to efficient RNNs and offer practical insights for sparsity strategies and NAS-inspired design.

Abstract

In the past few years, neural networks have evolved from simple Feedforward Neural Networks to more complex neural networks, such as Convolutional Neural Networks and Recurrent Neural Networks. Where CNNs are a perfect fit for tasks where the sequence is not important such as image recognition, RNNs are useful when order is important such as machine translation. An increasing number of layers in a neural network is one way to improve its performance, but it also increases its complexity making it much more time and power-consuming to train. One way to tackle this problem is to introduce sparsity in the architecture of the neural network. Pruning is one of the many methods to make a neural network architecture sparse by clipping out weights below a certain threshold while keeping the performance near to the original. Another way is to generate arbitrary structures using random graphs and embed them between an input and output layer of an Artificial Neural Network. Many researchers in past years have focused on pruning mainly CNNs, while hardly any research is done for the same in RNNs. The same also holds in creating sparse architectures for RNNs by generating and embedding arbitrary structures. Therefore, this thesis focuses on investigating the effects of the before-mentioned two techniques on the performance of RNNs. We first describe the pruning of RNNs, its impact on the performance of RNNs, and the number of training epochs required to regain accuracy after the pruning is performed. Next, we continue with the creation and training of Sparse Recurrent Neural Networks and identify the relation between the performance and the graph properties of its underlying arbitrary structure. We perform these experiments on RNN with Tanh nonlinearity (RNN-Tanh), RNN with ReLU nonlinearity (RNN-ReLU), GRU, and LSTM. Finally, we analyze and discuss the results achieved from both the experiments.
Paper Structure (67 sections, 40 equations, 74 figures, 11 tables, 2 algorithms)

This paper contains 67 sections, 40 equations, 74 figures, 11 tables, 2 algorithms.

Figures (74)

  • Figure 1: Hand-written digits: Each hand-written digit is a $28\times28$ pixel grayscale image. The entire MNIST dataset contains a total of $70000$ such images that split into a training set of $60000$ images and a test set of $10000$ images.
  • Figure 3: Single Layer Perceptron with binary inputs $x_1, x_2, x_3, ..., x_n$ and its corresponding weights $w_1, w_2, w_3, ..., w_n$.
  • Figure 4: Line plot of ReLU activation function
  • Figure 5: Line plot of Sigmoid activation function sigmoid
  • Figure 6: Line plot of Tanh activation function tanh
  • ...and 69 more figures