Table of Contents
Fetching ...

Plastic Learning with Deep Fourier Features

Alex Lewandowski, Dale Schuurmans, Marlos C. Machado

TL;DR

The empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features, and this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks.

Abstract

Deep neural networks can struggle to learn continually in the face of non-stationarity. This phenomenon is known as loss of plasticity. In this paper, we identify underlying principles that lead to plastic algorithms. In particular, we provide theoretical results showing that linear function approximation, as well as a special case of deep linear networks, do not suffer from loss of plasticity. We then propose deep Fourier features, which are the concatenation of a sine and cosine in every layer, and we show that this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks. Deep networks composed entirely of deep Fourier features are highly trainable and sustain their trainability over the course of learning. Our empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features. These results hold for different continual learning scenarios (e.g., label noise, class incremental learning, pixel permutations) on all major supervised learning datasets used for continual learning research, such as CIFAR10, CIFAR100, and tiny-ImageNet.

Plastic Learning with Deep Fourier Features

TL;DR

The empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features, and this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks.

Abstract

Deep neural networks can struggle to learn continually in the face of non-stationarity. This phenomenon is known as loss of plasticity. In this paper, we identify underlying principles that lead to plastic algorithms. In particular, we provide theoretical results showing that linear function approximation, as well as a special case of deep linear networks, do not suffer from loss of plasticity. We then propose deep Fourier features, which are the concatenation of a sine and cosine in every layer, and we show that this combination provides a dynamic balance between the trainability obtained through linearity and the effectiveness obtained through the nonlinearity of neural networks. Deep networks composed entirely of deep Fourier features are highly trainable and sustain their trainability over the course of learning. Our empirical results show that continual learning performance can be drastically improved by replacing ReLU activations with deep Fourier features. These results hold for different continual learning scenarios (e.g., label noise, class incremental learning, pixel permutations) on all major supervised learning datasets used for continual learning research, such as CIFAR10, CIFAR100, and tiny-ImageNet.

Paper Structure

This paper contains 34 sections, 6 theorems, 16 equations, 11 figures.

Key Result

Theorem 1

Let $\theta^{(\tau T)}$ denote the linear weights learned at the end of the $\tau$-th task, with the corresponding unique global minimum for task $\tau$ being denoted by $\theta_{\tau}^{\star}$. Assuming the objective function is $\mu$-strongly convex, the suboptimality gap for gradient descent on t where each task lasts for $T$ iteration, $D$ is the assumed bound on the parameters at the global m

Figures (11)

  • Figure 1: A neural network with deep Fourier features in every layer approximately embeds a deep linear network. A single layer using deep Fourier features linearly combines the inputs, $x$, to compute the pre-activations, $z$, and each pre-activation is mapped to both a cos unit and a sin unit (Left). For each pre-activation, either the sin unit (Middle) or the cos unit (Right) is well-approximated by a linear function.
  • Figure 2: Trainability on a linearly separable task. The higher opacity corresponds to deeper networks, ranging from {1, 2, 4, 8, 16}. Deep linear networks sustain trainability on new tasks, with some additional depth improving trainability. Nonlinear networks, using ReLU, suffer from loss of trainability at any depth even on this simple sequence of linearly separable problems.
  • Figure 3: Trainability on a linearly separable task with $\alpha$-linearization Darker opacity lines correspond to higher values of $\alpha$. Unit sign entropy increases as $\alpha$ increases (inset), leading to sustained trainability for $\alpha$-relu.
  • Figure 4: Trainability on a non linearly-separable task. Deep Fourier features improve and sustain their trainability when other networks cannot.
  • Figure 5: Training a ResNet-18 continually with diminishing label noise. Deep Fourier features are particularly performant on complex tasks like tiny-ImageNet. Despite networks with deep Fourier features having approximately half the number of parameters, they surpass the baselines in CIFAR100 and are on-par with spectral regularization on CIFAR10.
  • ...and 6 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Definition 1: Unit Sign Entropy
  • Definition 2: $\alpha$-linearization
  • Proposition 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • proof : Proof of Theorem \ref{['thm:linear']}
  • proof : Proof of Lemma \ref{['lem:equality']}
  • ...and 4 more