Table of Contents
Fetching ...

Learning without Forgetting

Zhizhong Li, Derek Hoiem

TL;DR

Addresses continual learning for CNNs when old-task data are unavailable and introduces Learning without Forgetting (LwF), a distillation-based objective that preserves old-task outputs on new-task inputs while learning new-task predictions. The method combines a new-task loss with a response-preserving distillation term and a warm-up step, enabling joint optimization without old data. Across diverse datasets and task pairs, LwF delivers strong new-task performance, maintains old-task accuracy better than fine-tuning or feature extraction, and often matches joint training. This approach offers a practical, scalable solution for extending vision systems with new capabilities without retaining prior datasets.

Abstract

When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.

Learning without Forgetting

TL;DR

Addresses continual learning for CNNs when old-task data are unavailable and introduces Learning without Forgetting (LwF), a distillation-based objective that preserves old-task outputs on new-task inputs while learning new-task predictions. The method combines a new-task loss with a response-preserving distillation term and a warm-up step, enabling joint optimization without old data. Across diverse datasets and task pairs, LwF delivers strong new-task performance, maintains old-task accuracy better than fine-tuning or feature extraction, and often matches joint training. This approach offers a practical, scalable solution for extending vision systems with new capabilities without retaining prior datasets.

Abstract

When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: We wish to add new prediction tasks to an existing CNN vision system without requiring access to the training data for existing tasks. This table shows relative advantages of our method compared to commonly used methods.
  • Figure 2: Illustration for our method (e) and methods we compare to (b-d). Images and labels used in training are shown. Data for different tasks are used in alternation in joint training.
  • Figure 3: Procedure for Learning without Forgetting.
  • Figure 4: Performance of each task when gradually adding new tasks to a pre-trained network. Different tasks are shown in different sub-graphs. The $x$-axis labels indicate the new task added to the network each time. Error bars shows $\pm 2$ standard deviations for 3 runs with different $\theta_n$ random initializations. Markers are jittered horizontally for visualization, but line plots are not jittered to facilitate comparison. For all tasks, our method degrades slower over time than fine-tuning and outperforms feature extraction in most scenarios. For Places2$\rightarrow$VOC, our method performs comparably to joint training.
  • Figure 5: Influence of subsampling new task training set on compared methods. The $x$-axis indicates diminishing training set size. Three runs of our experiments with different random $\theta_n$ initialization and dataset subsampling are shown. Scatter points are jittered horizontally for visualization, but line plots are not jittered to facilitate comparison. Differences between LwF and compared methods on both the old task and the new task decrease with less data, but the observations remain the same. LwF outperforms fine-tuning despite the change in training set size.
  • ...and 2 more figures