Gradient-Free Training of Recurrent Neural Networks using Random Perturbations

Jesus Garcia Fernandez; Sander Keemink; Marcel van Gerven

Gradient-Free Training of Recurrent Neural Networks using Random Perturbations

Jesus Garcia Fernandez, Sander Keemink, Marcel van Gerven

TL;DR

The recently introduced activity-based node perturbation (ANP) method is extended to operate in the time domain, leading to more efficient learning and generalization and suggests that perturbation-based learning methods offer a versatile alternative to gradient-based methods for training RNNs which can be ideally suited for neuromorphic computing applications.

Abstract

Recurrent neural networks (RNNs) hold immense potential for computations due to their Turing completeness and sequential processing capabilities, yet existing methods for their training encounter efficiency challenges. Backpropagation through time (BPTT), the prevailing method, extends the backpropagation (BP) algorithm by unrolling the RNN over time. However, this approach suffers from significant drawbacks, including the need to interleave forward and backward phases and store exact gradient information. Furthermore, BPTT has been shown to struggle to propagate gradient information for long sequences, leading to vanishing gradients. An alternative strategy to using gradient-based methods like BPTT involves stochastically approximating gradients through perturbation-based methods. This learning approach is exceptionally simple, necessitating only forward passes in the network and a global reinforcement signal as feedback. Despite its simplicity, the random nature of its updates typically leads to inefficient optimization, limiting its effectiveness in training neural networks. In this study, we present a new approach to perturbation-based learning in RNNs whose performance is competitive with BPTT, while maintaining the inherent advantages over gradient-based learning. To this end, we extend the recently introduced activity-based node perturbation (ANP) method to operate in the time domain, leading to more efficient learning and generalization. We subsequently conduct a range of experiments to validate our approach. Our results show similar performance, convergence time and scalability compared to BPTT, strongly outperforming standard node and weight perturbation methods. These findings suggest that perturbation-based learning methods offer a versatile alternative to gradient-based methods for training RNNs which can be ideally suited for neuromorphic computing applications

Gradient-Free Training of Recurrent Neural Networks using Random Perturbations

TL;DR

Abstract

Paper Structure (20 sections, 13 equations, 14 figures, 2 tables)

This paper contains 20 sections, 13 equations, 14 figures, 2 tables.

Introduction
Methods
Recurrent neural network model
Node perturbation through time
Activity-based node perturbation through time
Weight perturbation through time
Decorrelation of neural inputs
Experimental validation
Results
Mackey-Glass time series task
Copying memory task
Weather prediction task
Scaling performance
Decorrelation results
Discussion
...and 5 more sections

Figures (14)

Figure 1: Gradient-based vs perturbation-based learning. Example depicts networks unrolled across 3 time steps. a) General procedure followed by gradient-based learning approaches. Sequential computation of the forward and backward passes is necessary to calculate updates. b) General procedure utilized by perturbation-based learning approaches. The computation of the eligibility trace varies based on the employed algorithm (e.g., NP, WP, ANP). In perturbation-based learning, the forward pass and noisy forward pass can be parallelized by employing two models.
Figure 2: Recurrent neural network model. Recurrent units are interconnected and self-connected. Vectors $u_t$, $x_t$ and $y_t$ denote the input, recurrent and output layer activations, respectively.
Figure 3: RNN with decorrelation scheme. In this setup, we include an extra matrix, $D$, and an intermediate state that transforms the correlated neural input $x_t$, in uncorrelated neural input $x^*_t$. The recurrent connection, $R$ is placed after the decorrelated state $x^*_t$ feeding an input to $x_{t+1}$ (in the next time step). The recurrent connection $R$ is fully connected. $x_t$ is the only variable that includes non-linearities, $u_t$, $x^*_t$ and $y_t$ are linear. The variable $x^*_t$ is used to map the recurrent states to the outputs.
Figure 4: Mackey-Glass data and results.a) 500 time steps of a synthetically generated Mackey-Glass time series along with the predictions of a BP-trained model before and after training. b) Performance during training over the train and test set for the different methods, represented in a logarithmic scale. c) Final performance for the different methods, computed as the mean performance over the last 50 epochs.
Figure 5: Copying memory data and results.a) At the top, we depict an example of an input with annotations. The sequence length is 20 and the delay period is 10. At the bottom, we show the predictions of a BP-trained model before and after training. b) Performance during training over the train and test set for the different methods. c) Final performance for the different methods, computed as the mean performance over the last 50 epochs.
...and 9 more figures

Gradient-Free Training of Recurrent Neural Networks using Random Perturbations

TL;DR

Abstract

Gradient-Free Training of Recurrent Neural Networks using Random Perturbations

Authors

TL;DR

Abstract

Table of Contents

Figures (14)