Learning Continually by Spectral Regularization

Alex Lewandowski; Michał Bortkiewicz; Saurabh Kumar; András György; Dale Schuurmans; Mateusz Ostaszewski; Marlos C. Machado

Learning Continually by Spectral Regularization

Alex Lewandowski, Michał Bortkiewicz, Saurabh Kumar, András György, Dale Schuurmans, Mateusz Ostaszewski, Marlos C. Machado

TL;DR

A new technique for improving continual learning is developed inspired by the observation that the singular values of the neural network parameters at initialization are an important factor for trainability during early phases of learning.

Abstract

Loss of plasticity is a phenomenon where neural networks can become more difficult to train over the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good performance while maintaining network trainability. We develop a new technique for improving continual learning inspired by the observation that the singular values of the neural network parameters at initialization are an important factor for trainability during early phases of learning. From this perspective, we derive a new spectral regularizer for continual learning that better sustains these beneficial initialization properties throughout training. In particular, the regularizer keeps the maximum singular value of each layer close to one. Spectral regularization directly ensures that gradient diversity is maintained throughout training, which promotes continual trainability, while minimally interfering with performance in a single task. We present an experimental analysis that shows how the proposed spectral regularizer can sustain trainability and performance across a range of model architectures in continual supervised and reinforcement learning settings. Spectral regularization is less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance.

Learning Continually by Spectral Regularization

TL;DR

Abstract

Paper Structure (40 sections, 4 equations, 16 figures)

This paper contains 40 sections, 4 equations, 16 figures.

Introduction
Problem Setting
Spectral Properties and Continual Trainability
Spectral properties at initialization
An illustrative example
Trainability and Effective Gradient Diversity
Why Do Spectral Properties Deviate From Initialization?
Spectral Regularization for Continual Learning
Experiments
Datasets, Nonstationarities, and Architectures
Loss of Trainability Mitigators
Comparative Evaluation
Looking Inside the Network
Sensitivity Analysis
From Supervised Learning to Reinforcement Learning
...and 25 more sections

Figures (16)

Figure 1: Generalization across different types of non-stationarity on tiny-ImageNet using a ResNet (top) or a Vision Transformer (bottom). Compared to the baselines, spectral regularization is consistently among the best-performing methods across class incremental, label flip, and pixel permutation non-stationarities. Note that the Vision Transformer often achieves better generalization performance than the ResNet architecture.
Figure 2: Continual learning with pixel permutation tasks on SVHN2, CIFAR10, CIFAR100 using a ResNet-18 (top) or a Vision Transformer (bottom). Across different datasets, spectral regularization is effective at maintaining test accuracy on new tasks. Without any mitigators, both ResNet-18 and Vision Transformer have diminishing test accuracy, suggesting loss of plasticity.
Figure 3: Trainability and neural network properties across ImageNet, CIFAR10, and CIFAR100. Baselines that suffer from a loss of trainability (top) also have an increasing average spectral norm (middle-top), and a decrease in their average representation change (middle-bottom).
Figure 4: Sensitivity analysis on regularization strength. Compared to other regularizers, spectral regularization is insensitive to regularization strength while sustaining higher trainability for any given regularization strength.
Figure 5: Spectral regularization enhances plasticity in reinforcement learning in the DMC suite. Spectral regularization is competitive with the network reset + layernorm (Reset) even when the replay buffer is unbounded (Top Left). When the replay buffer size is bounded to 250k steps, spectral regularization improves over the Reset baseline (Bottom Left). In both cases, Spectral regularization significantly outperforms layernorm (Baseline), Compared to the hyperparameter governing the reset frequency, spectral regularization is less sensitive to its regularization strength. Spectral regularization also prevents both parameter and gradients from exploding, and reduces value overestimation (Right).
...and 11 more figures

Learning Continually by Spectral Regularization

TL;DR

Abstract

Learning Continually by Spectral Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (16)