Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning

Lapo Frati; Neil Traft; Jeff Clune; Nick Cheney

Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning

Lapo Frati, Neil Traft, Jeff Clune, Nick Cheney

TL;DR

This work identifies a simple pre-training mechanism that leads to representations exhibiting better continual and transfer learning, and suggests this approach may be considered a computationally cheaper type of, or alternative to, meta-learning rapidly adaptable features with higher-order gradients.

Abstract

This work identifies a simple pre-training mechanism that leads to representations exhibiting better continual and transfer learning. This mechanism -- the repeated resetting of weights in the last layer, which we nickname "zapping" -- was originally designed for a meta-continual-learning procedure, yet we show it is surprisingly applicable in many settings beyond both meta-learning and continual learning. In our experiments, we wish to transfer a pre-trained image classifier to a new set of classes, in a few shots. We show that our zapping procedure results in improved transfer accuracy and/or more rapid adaptation in both standard fine-tuning and continual learning settings, while being simple to implement and computationally efficient. In many cases, we achieve performance on par with state of the art meta-learning without needing the expensive higher-order gradients, by using a combination of zapping and sequential learning. An intuitive explanation for the effectiveness of this zapping procedure is that representations trained with repeated zapping learn features that are capable of rapidly adapting to newly initialized classifiers. Such an approach may be considered a computationally cheaper type of, or alternative to, meta-learning rapidly adaptable features with higher-order gradients. This adds to recent work on the usefulness of resetting neural network parameters during training, and invites further investigation of this mechanism.

Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning

TL;DR

Abstract

Paper Structure (19 sections, 14 figures, 9 tables, 3 algorithms)

This paper contains 19 sections, 14 figures, 9 tables, 3 algorithms.

Methods
Training Phases
Stage 1: Pre-Training
Stage 2: Transfer
Results
Continual Learning
Transfer Learning
Toward Larger Architectures
Discussion
Related work
Conclusion & Future Work
Full Result Tables
Separate Weights for Inner and Outer Loops
Hyperparameters
Network Architecture
...and 4 more sections

Figures (14)

Figure 1: Alternating Sequential and Batch learning (ASB) alternates between phases of (Step 2) individual examples from a single class, and (Step 4) multi-class batches of examples. Before each sequential phase the existing class is forgotten by ⚡ zapping.
Figure 2: Sequential learning trajectories on Omniglot. Removing the neuromodulation layers from ANML has no impact on performance (Meta-ASB and ANML both achieve 67% final accuracy). Removing zapping, however, drastically affects performance, even when employing meta-learning. We do not compare directly to OML since ANML represents the state of the art.
Figure 3: Average accuracy (and std dev error bars) for the sequential transfer learning problem, on Omniglot. Pre-Train is the final validation accuracy of the model on the pre-training dataset. All the layers are trained during pre-train. Transfer is the accuracy on held-out instances from the transfer-to dataset at the very end of sequential fine-tuning. Only the last layer is trained (linear probing) during transfer. Models trained with zapping produce significantly ($p < 10^{-8}$) better transfer accuracy than their counterparts without zapping in all cases (p-values of a two-sided Mann-Whitney U test are shown above each pair of bars). Note that the ANML model contains zapping by default and is therefore shaded in the legend.
Figure 4: Accuracy on classes seen so far during continual transfer learning on Mini-ImageNet. Models are trained on 30 examples from 20 new classes not seen during pre-training. All 30 images from a class are shown sequentially one at a time before switching to the next class. After each class, validation accuracy on the transfer set is measured using 100 examples per class, from all classes seen up to that point. Models pre-trained with ASB (with or without meta-gradients) significantly outperform i.i.d. pre-training. ASB+zapping further outperforms plain ASB ($p < 10^{-10}$).
Figure 5: Validation accuracy over training time on all classes in the transfer set during fine-tuning with standard i.i.d. batches. For all datasets, models pre-trained with zapping achieve significantly higher transfer accuracy at end of fine-tuning. While ASB methods (green, orange) do not dramatically improve final performance, they achieve more rapid fine-tuning relative to i.i.d.+zap pre-training (blue).
...and 9 more figures

Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning

TL;DR

Abstract

Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)