Table of Contents
Fetching ...

Same accuracy, twice as fast: continuous training surpasses retraining from scratch

Eli Verwimp, Guy Hacohen, Tinne Tuytelaars

TL;DR

The paper tackles the high computational cost of continual learning when old and new data are both available. It introduces an evaluation framework that measures training efficiency via iterations to reach target accuracy and reports relative speedups over retraining from scratch, with gains up to about 2.7x. By identifying four optimization axes—initialization, regularization, batch composition, and learning rate scheduling—and providing first-step methods for each, the authors show these techniques are complementary and broadly effective across CV tasks. Empirically, combining these methods yields substantial reductions in training compute while maintaining or improving final accuracy, demonstrating practical impact for scalable continual learning with access to full old data.

Abstract

Continual learning aims to enable models to adapt to new datasets without losing performance on previously learned data, often assuming that prior data is no longer available. However, in many practical scenarios, both old and new data are accessible. In such cases, good performance on both datasets is typically achieved by abandoning the model trained on the previous data and re-training a new model from scratch on both datasets. This training from scratch is computationally expensive. In contrast, methods that leverage the previously trained model and old data are worthy of investigation, as they could significantly reduce computational costs. Our evaluation framework quantifies the computational savings of such methods while maintaining or exceeding the performance of training from scratch. We identify key optimization aspects -- initialization, regularization, data selection, and hyper-parameters -- that can each contribute to reducing computational costs. For each aspect, we propose effective first-step methods that already yield substantial computational savings. By combining these methods, we achieve up to 2.7x reductions in computation time across various computer vision tasks, highlighting the potential for further advancements in this area.

Same accuracy, twice as fast: continuous training surpasses retraining from scratch

TL;DR

The paper tackles the high computational cost of continual learning when old and new data are both available. It introduces an evaluation framework that measures training efficiency via iterations to reach target accuracy and reports relative speedups over retraining from scratch, with gains up to about 2.7x. By identifying four optimization axes—initialization, regularization, batch composition, and learning rate scheduling—and providing first-step methods for each, the authors show these techniques are complementary and broadly effective across CV tasks. Empirically, combining these methods yields substantial reductions in training compute while maintaining or improving final accuracy, demonstrating practical impact for scalable continual learning with access to full old data.

Abstract

Continual learning aims to enable models to adapt to new datasets without losing performance on previously learned data, often assuming that prior data is no longer available. However, in many practical scenarios, both old and new data are accessible. In such cases, good performance on both datasets is typically achieved by abandoning the model trained on the previous data and re-training a new model from scratch on both datasets. This training from scratch is computationally expensive. In contrast, methods that leverage the previously trained model and old data are worthy of investigation, as they could significantly reduce computational costs. Our evaluation framework quantifies the computational savings of such methods while maintaining or exceeding the performance of training from scratch. We identify key optimization aspects -- initialization, regularization, data selection, and hyper-parameters -- that can each contribute to reducing computational costs. For each aspect, we propose effective first-step methods that already yield substantial computational savings. By combining these methods, we achieve up to 2.7x reductions in computation time across various computer vision tasks, highlighting the potential for further advancements in this area.

Paper Structure

This paper contains 31 sections, 6 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Test accuracy on CIFAR100 (70+30) with a model pre-trained on 70 classes. The 'scratch' method starts from random initialization, while the 'naive' approach uses the pre-trained model without modification. 'Ours' modifies the optimization process (see Section \ref{['sec:method']}) and matches the scratch performance with $2.7\times$ lower computational cost.
  • Figure 2: Initialization. Naive continuous training is slower and less accurate than retraining from scratch. Re-introducing plasticity with shrink-and-perturb improves both speed and accuracy, surpassing scratch training.
  • Figure 3: Objective function. Regularizing the objective function with $L2$-losses is beneficial in both from scratch and continuous learning, yet the latter outperforms the former when using $L2$-init regularization.
  • Figure 4: Batch composition. 'Old / new' sampling balances old and new examples in each batch, unlike the naive baseline, which uses proportional sampling. 'Easy / hard' sampling reduces the inclusion of the easiest and hardest old examples, significantly improving performance. (Naive and 'old/new' results nearly overlap.)
  • Figure 5: Hyperparameters. Shortening the learning rate scheduler allows for faster convergence but at the cost of lower final accuracy. On its own, changing the scheduler does not reach the required accuracy, yet when combined with the other aspects, it becomes important (see Section \ref{['sec:ablation']})
  • ...and 10 more figures