Directions of Curvature as an Explanation for Loss of Plasticity

Alex Lewandowski; Haruto Tanaka; Dale Schuurmans; Marlos C. Machado

Directions of Curvature as an Explanation for Loss of Plasticity

Alex Lewandowski, Haruto Tanaka, Dale Schuurmans, Marlos C. Machado

TL;DR

This paper offers a consistent explanation for loss of plasticity: Neural networks lose directions of curvature during training and that loss of plasticity can be attributed to this reduction in curvature, and shows that regularizers which mitigate loss of plasticity also preserve curvature.

Abstract

Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for loss of plasticity: Neural networks lose directions of curvature during training and that loss of plasticity can be attributed to this reduction in curvature. To support such a claim, we provide a systematic investigation of loss of plasticity across continual learning tasks using MNIST, CIFAR-10 and ImageNet. Our findings illustrate that loss of curvature directions coincides with loss of plasticity, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings we considered.

Directions of Curvature as an Explanation for Loss of Plasticity

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 23 figures)

This paper contains 34 sections, 1 equation, 23 figures.

Introduction
Factors and Explanations for Loss of Plasticity
Factors That Can Contribute to Loss of Plasticity
Previous Explanations for Loss of Plasticity
Counterexamples for Previous Explanations
Methods
Results
Summary
Measuring the Curvature of a Changing Optimization Landscape
Approximating the Hessian Rank
Validating the Hessian Rank Approximation
Preserving Curvature with Regularization
Experiments: Effect of Curvature and Regularization in Plasticity Benchmarks
Does Loss of Curvature Explain Loss of Plasticity?
How Does Loss of Curvature Affect Learning?
...and 19 more sections

Figures (23)

Figure 1: Inconsistencies of previous explanations for loss of plasticity on Random Label MNIST (subset). The explanations on the left are not consistent because both ReLU and leaky-ReLU suffer from loss of plasticity. On the right, there is no loss of plasticity for tanh and identity but the corresponding explanations predict that they do. All results have a shaded region corresponding to a 95% confidence interval of the mean over 30 runs.
Figure 2: Effect of feature rank regularization is maintaining plasticity. Loss of plasticity still occurs with leaky-ReLU and feature rank regularization, despite the fact that the feature rank remains high. All results have a shaded region corresponding to a 95% confidence interval of the mean over 30 runs.
Figure 3: Comparison between different methods for approximating the Hessian rank. The empirical Fisher approximation to the Hessian rank is highly accurate in the first few tasks, which is when loss of plasticity occurs. When plasticity worsens in later tasks, the approximation quality marginally worsens. Overall, the empirical Fisher is an accurate and efficient approximation to the Hessian rank.
Figure 4: Validating that a reduction in the directions of curvature is a consistent explanation for loss of plasticity. A reduction in the directions of curvature co-occurs with loss of plasticity. leaky-ReLU preserves plasticity for longer but is unable to recover its directions of curvature.
Figure 5: Curvature explains why the average update norm increases when using leaky-ReLU despite loss of plasticity. Left: leaky-ReLU has an increasing average update norm despite a decrease in the gradient norm at the beginning of a task. Right: gradients with leaky-ReLU have less overlap with the low-rank Hessian, meaning that updates occur in more directions than with ReLU.
...and 18 more figures

Directions of Curvature as an Explanation for Loss of Plasticity

TL;DR

Abstract

Directions of Curvature as an Explanation for Loss of Plasticity

Authors

TL;DR

Abstract

Table of Contents

Figures (23)