Table of Contents
Fetching ...

Eidetic Learning: an Efficient and Provable Solution to Catastrophic Forgetting

Nicholas Dronen, Randall Balestriero

TL;DR

Catastrophic forgetting remains a major hurdle for sequential task learning in neural networks. Eidetic Learning builds EideticNets that guarantee immunity to forgetting by iteratively pruning and freezing important neurons per task and recycling unimportant ones, enabling nested feature reuse without rehearsal. The approach supports common architectures, provides data-conditional routing through per-task heads, and uses a task classifier at inference time to select the appropriate head, achieving competitive results on Permuted MNIST, sequential CIFAR-100, and Imagenette with linear time/space characteristics. Overall, EideticNets offer a principled, scalable solution with strong practical impact for continual learning while outlining clear avenues for extending to backward transfer and more challenging class-incremental settings.

Abstract

Catastrophic forgetting -- the phenomenon of a neural network learning a task t1 and losing the ability to perform it after being trained on some other task t2 -- is a long-standing problem for neural networks [McCloskey and Cohen, 1989]. We present a method, Eidetic Learning, that provably solves catastrophic forgetting. A network trained with Eidetic Learning -- here, an EideticNet -- requires no rehearsal or replay. We consider successive discrete tasks and show how at inference time an EideticNet automatically routes new instances without auxiliary task information. An EideticNet bears a family resemblance to the sparsely-gated Mixture-of-Experts layer Shazeer et al. [2016] in that network capacity is partitioned across tasks and the network itself performs data-conditional routing. An EideticNet is easy to implement and train, is efficient, and has time and space complexity linear in the number of parameters. The guarantee of our method holds for normalization layers of modern neural networks during both pre-training and fine-tuning. We show with a variety of network architectures and sets of tasks that EideticNets are immune to forgetting. While the practical benefits of EideticNets are substantial, we believe they can be benefit practitioners and theorists alike. The code for training EideticNets is available at https://github.com/amazon-science/eideticnet-training.

Eidetic Learning: an Efficient and Provable Solution to Catastrophic Forgetting

TL;DR

Catastrophic forgetting remains a major hurdle for sequential task learning in neural networks. Eidetic Learning builds EideticNets that guarantee immunity to forgetting by iteratively pruning and freezing important neurons per task and recycling unimportant ones, enabling nested feature reuse without rehearsal. The approach supports common architectures, provides data-conditional routing through per-task heads, and uses a task classifier at inference time to select the appropriate head, achieving competitive results on Permuted MNIST, sequential CIFAR-100, and Imagenette with linear time/space characteristics. Overall, EideticNets offer a principled, scalable solution with strong practical impact for continual learning while outlining clear avenues for extending to backward transfer and more challenging class-incremental settings.

Abstract

Catastrophic forgetting -- the phenomenon of a neural network learning a task t1 and losing the ability to perform it after being trained on some other task t2 -- is a long-standing problem for neural networks [McCloskey and Cohen, 1989]. We present a method, Eidetic Learning, that provably solves catastrophic forgetting. A network trained with Eidetic Learning -- here, an EideticNet -- requires no rehearsal or replay. We consider successive discrete tasks and show how at inference time an EideticNet automatically routes new instances without auxiliary task information. An EideticNet bears a family resemblance to the sparsely-gated Mixture-of-Experts layer Shazeer et al. [2016] in that network capacity is partitioned across tasks and the network itself performs data-conditional routing. An EideticNet is easy to implement and train, is efficient, and has time and space complexity linear in the number of parameters. The guarantee of our method holds for normalization layers of modern neural networks during both pre-training and fine-tuning. We show with a variety of network architectures and sets of tasks that EideticNets are immune to forgetting. While the practical benefits of EideticNets are substantial, we believe they can be benefit practitioners and theorists alike. The code for training EideticNets is available at https://github.com/amazon-science/eideticnet-training.

Paper Structure

This paper contains 19 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Accuracy of an MLP trained on 10 tasks of Permuted MNIST in a single run of our method. Lines (bands) are a moving average (standard deviation) over a window of 10 steps. See \ref{['tab:pmnist-results']} for a comparison with other methods.
  • Figure 2: Eidetic Learning eliminates forgetting by preserving important neurons, deleting unimportant synapses, and recycling unimportant neurons for subsequent tasks. Preserving important neurons can be done in several ways. Figure \ref{['fig:intro-disjoint-graph']} depicts a network in which the neurons of task $t_2$ are completely separated from the neurons of task $t_1$. This is an inefficient use of a network's capacity. Since task $t_2$ is trained after task $t_1$, allowing the neurons important to $t_2$ to benefit from the features learned by $t_1$ in the previous layer is more efficient. This latter way is shown as the dashed orange lines in Figure \ref{['fig:intro-nested-graph']}. A detailed depiction along with the parameters' configurations is also provided in \ref{['fig:method-disjoint', 'fig:method-nested']}.
  • Figure 3: In Eidetic Training, the fraction of neurons pruned while training a task increases until training set accuracy drops, and the unpruned neurons are frozen cumulatively. Figure shows the progress of training with a ResNet50 trained on Sequential CIFAR100 with five tasks.
  • Figure 4: Test set accuracy of ResNet50 trained on Sequential CIFAR100 with five tasks.
  • Figure 5: Consider a feed-forward ANN with layers $\ell$, $\ell+1$ trained on some task $t_i$. Omitting for convenience the non-linearity $\sigma$ and the bias ${\bm{b}}$, processing the input ${\bm{x}}^{\ell-1}$ vector of all $1$s entails the matrix-vector products ${\bm{W}}{^\ell} {\bm{x}}^{\ell-1}$ and ${\bm{W}}^{\ell+1} {\bm{x}}^{\ell}$. We show them here as the composition ${\bm{W}}^{\ell+1}({\bm{W}}{^\ell} {\bm{x}}^{\ell-1})$. Imagine that the smallest set of neurons required to perform $t$ is determined to be $\mathcal{N}_{t} := \{ {\bm{W}}_2^{\ell}, {\bm{W}}_4^{\ell}, {\bm{W}}_2^{\ell+1}, {\bm{W}}_3^{\ell+1}, {\bm{W}}_4^{\ell+1} \}$ (white). For task $t_i$, the excess capacity consists of all other neurons, and the neurons to recycle, $\mathcal{R}$ when training $t_{i+1}$ are $\{ {\bm{W}}_1^{\ell}, {\bm{W}}_3^{\ell}, {\bm{W}}_1^{\ell+1} \}$ (blue in ${\bm{W}}^{\ell}$, red in ${\bm{W}}^{\ell+1}$). While training $t_i$, we prune the neurons $\mathcal{R}$ and permanently delete their synaptic connections to the important neurons (blue $\emptyset$s in ${\bm{W}}^{\ell}$). When training of $t_i$ is complete, we reinitialize the neurons in $\mathcal{R}$ from some random variable $X$. Figure \ref{['fig:method-disjoint']} illustrates the naive approach that leads to the complete partitioning of task $t_i$ from $t_{j>i}$ (cf. Figure \ref{['fig:intro-disjoint-graph']}). The efficient nested feature sharing that EideticNets enable is shown in Figure \ref{['fig:method-nested']} (cf. Figure \ref{['fig:intro-nested-graph']}).