Table of Contents
Fetching ...

Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

Shibhansh Dohare, Richard S. Sutton, A. Rupam Mahmood

TL;DR

The paper addresses the challenge of continual learning in non-stationary environments, showing that standard Backprop and its variants exhibit decaying plasticity over time. It introduces Continual Backprop (CBP), which continually injects random features through a generate-and-test loop to preserve initial randomness benefits and sustain adaptation. CBP demonstrates continual learning capabilities across semi-stationary supervised tasks and a non-stationary reinforcement learning problem, outperforming Backprop baselines and offering a practical extension with similar computational cost. This work provides a path toward robust continual learning in real-world, non-stationary settings by integrating ongoing random feature discovery with gradient-based optimization.

Abstract

The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the \textit{Continual Backprop} algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.

Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

TL;DR

The paper addresses the challenge of continual learning in non-stationary environments, showing that standard Backprop and its variants exhibit decaying plasticity over time. It introduces Continual Backprop (CBP), which continually injects random features through a generate-and-test loop to preserve initial randomness benefits and sustain adaptation. CBP demonstrates continual learning capabilities across semi-stationary supervised tasks and a non-stationary reinforcement learning problem, outperforming Backprop baselines and offering a practical extension with similar computational cost. This work provides a path toward robust continual learning in real-world, non-stationary settings by integrating ongoing random feature discovery with gradient-based optimization.

Abstract

The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the \textit{Continual Backprop} algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.

Paper Structure

This paper contains 17 sections, 3 equations, 19 figures, 3 algorithms.

Figures (19)

  • Figure 1: The input and target function generating the output in the Bit-Flipping problem. The input has $m+1$ bits. One of the flipping bits is chosen after every $T$ time-steps, and its value is flipped. The next $m-f$ bits are i.i.d. at every time-step. The target function is represented by a neural network with a single hidden layer of LTUs.
  • Figure 2: The learning curve on the Bit-Flipping problem using Backprop. Surprisingly, after performing well initially, the error goes up for all step-sizes and activation functions. Backprop's ability to track becomes worse under extended tracking on the Bit flipping problem. For Relu, its performance gets even worse than the linear learner.
  • Figure 3: The online classification accuracy of a deep ReLU-network on Permuted MNIST. The online accuracy is binned among bins of size 60,000. The performance of Backprop gets worse over time for all step-sizes, meaning that Backprop loses its ability to adapt under extended tracking.
  • Figure 4: A feature/hidden-unit in a network. The utility of a feature at time $t$ is the product of its contribution utility and its adaptation utility. Adaptation utility is the inverse of the sum of the magnitude of the incoming weights. And, contribution utility is the product of the magnitude of the outgoing weights and feature activation ($h_{l,i}$) minus its average ($\hat{f}_{l,i}$). $\hat{f}_{l,i}$ is a running average of $h_{l,i}$.
  • Figure 5: The learning curves and parameter sensitivity plots of Backprop(BP), Backprop with L2, Backprop with Online Normalization, and Continual Backprop (CBP) on the Bit-Flipping problem. Only CBP has a non-increasing error rate in all cases. Continually injecting randomness alongside gradient descent, CBP, is better for continual adaptation than just gradient descent, BP.
  • ...and 14 more figures