Table of Contents
Fetching ...

Deep Rewiring: Training very sparse deep networks

Guillaume Bellec, David Kappel, Wolfgang Maass, Robert Legenstein

TL;DR

Deep Rewiring addresses training deep networks under strict connectivity limits by jointly learning weights and sparse architectures via stochastic rewiring. It frames rewiring as sampling from a tempered posterior over both parameters and connectivity, enforcing a hard bound on active connections and enabling online adaptation to task demands. Empirically, DEEP R achieves competitive performance at very high sparsity across feedforward, convolutional, and recurrent models, often outperforming pruning-based methods at similar sparsities and enabling transfer-like learning. Theoretical results establish convergence to a stationary constrained posterior, providing a rigorous foundation for sparse online learning with potential hardware benefits. Overall, DEEP R offers a principled, brain-inspired approach to efficient, on-chip training and deployment of sparse deep networks.

Abstract

Neuromorphic hardware tends to pose limits on the connectivity of deep networks that one can run on them. But also generic hardware and software implementations of deep learning run more efficiently for sparse networks. Several methods exist for pruning connections of a neural network after it was trained without connectivity constraints. We present an algorithm, DEEP R, that enables us to train directly a sparsely connected neural network. DEEP R automatically rewires the network during supervised training so that connections are there where they are most needed for the task, while its total number is all the time strictly bounded. We demonstrate that DEEP R can be used to train very sparse feedforward and recurrent neural networks on standard benchmark tasks with just a minor loss in performance. DEEP R is based on a rigorous theoretical foundation that views rewiring as stochastic sampling of network configurations from a posterior.

Deep Rewiring: Training very sparse deep networks

TL;DR

Deep Rewiring addresses training deep networks under strict connectivity limits by jointly learning weights and sparse architectures via stochastic rewiring. It frames rewiring as sampling from a tempered posterior over both parameters and connectivity, enforcing a hard bound on active connections and enabling online adaptation to task demands. Empirically, DEEP R achieves competitive performance at very high sparsity across feedforward, convolutional, and recurrent models, often outperforming pruning-based methods at similar sparsities and enabling transfer-like learning. Theoretical results establish convergence to a stationary constrained posterior, providing a rigorous foundation for sparse online learning with potential hardware benefits. Overall, DEEP R offers a principled, brain-inspired approach to efficient, on-chip training and deployment of sparse deep networks.

Abstract

Neuromorphic hardware tends to pose limits on the connectivity of deep networks that one can run on them. But also generic hardware and software implementations of deep learning run more efficiently for sparse networks. Several methods exist for pruning connections of a neural network after it was trained without connectivity constraints. We present an algorithm, DEEP R, that enables us to train directly a sparsely connected neural network. DEEP R automatically rewires the network during supervised training so that connections are there where they are most needed for the task, while its total number is all the time strictly bounded. We demonstrate that DEEP R can be used to train very sparse feedforward and recurrent neural networks on standard benchmark tasks with just a minor loss in performance. DEEP R is based on a rigorous theoretical foundation that views rewiring as stochastic sampling of network configurations from a posterior.

Paper Structure

This paper contains 18 sections, 3 theorems, 47 equations, 6 figures, 3 algorithms.

Key Result

Theorem 1

Let $p^{*} (\boldsymbol{\theta} \,|\, \mathbf{X}, \mathbf{Y}^*)$ be a strictly positive, continuous probability distribution over parameters $\boldsymbol{\theta}$, twice continuously differentiable with respect to $\boldsymbol{\theta}$, and let $\beta>0$. Then the set of stochastic differential equa

Figures (6)

  • Figure 1: Visual pattern recognition with sparse networks during training. Sample training images (top), test classification accuracy after training for various connectivity levels (middle) and example test accuracy evolution during training (bottom) for a standard feed forward network trained on MNIST (A) and a CNN trained on CIFAR-10 (B). Accuracies are shown for various algorithms. Green: DEEP R; red: soft-DEEP R; blue: SGD with initially fixed sparse connectivity; dashed gray: SGD, fully connected. Since soft-DEEP R does not guarantee a strict upper bound on the connectivity, accuracies are plotted against the highest connectivity ever met during training (middle panels). Iteration number refers to the number of parameter updates during training.
  • Figure 2: Rewiring in recurrent neural networks. Network performance for one example run (A) and at various connectivity levels (B) as in Fig. \ref{['fig:mnist_cifar']} for an LSTM network trained on the TIMIT dataset with DEEP R (green), soft-DEEP R (red) and a network with fixed random connectivity (blue). Dotted line: fully connected LSTM trained without regularization as reported in greff2017lstm. Thick dotted line: fully connected LSTM with $\ell_2$ regularization.
  • Figure 3: Efficient network solutions under strict sparsity constraints. Accuracy and connectivity obtained by DEEP R and soft-DEEP R in comparison to those achieved by pruning han_learning_2015 and $\ell_1$-shrinkage tibshirani1996regressioncollins_memory_2014. A, B) Accuracy against the connectivity for MNIST (A) and CIFAR-10 (B). For each algorithm, one network with a decent compromise between accuracy and sparsity is chosen (small gray boxes) and its connectivity across training iterations is shown below. C) Performance on the TIMIT dataset. D) Phoneme error rates and connectivities across iteration number for representative training sessions.
  • Figure 4: Transfer learning with DEEP R. The target labels of the MNIST data set were shuffled after every epoch. A) Network accuracy vs. training epoch. The increase of network performance across tasks (epochs) indicates a transfer of knowledge between tasks. B) Correlation between weight matrices of subsequent epochs for each network layer. C) Correlation between neural activity vectors of subsequent epochs for each network layer. The transfer is most visible in the first hidden layer, since weights and outputs of this layer are correlated across tasks. Shaded areas in B) and C) represent standard deviation across 5 random seeds, influencing network initialization, noisy parameter updates, and shuffling of the outputs.
  • Figure 5: Hyper-parameter search for the pruning algorithm according to han_learning_2015. Each point of the grid represents a weight decay coefficient -- quality factor pair. The number and the color indicate the performance in terms of accuracy (left) or connectivity (right). The red rectangle indicates the data points that were used in Fig. \ref{['fig:comparison']}A.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • proof
  • Theorem 2
  • proof