Table of Contents
Fetching ...

Towards Robust Evaluations of Continual Learning

Sebastian Farquhar, Yarin Gal

TL;DR

The paper critiques current continual learning evaluations as biased and unrepresentative of real-world challenges. It introduces five core desiderata for robust benchmarks and empirically shows that many leading approaches fail when evaluated under all desiderata, often favoring prior-focused methods. By comparing prior-focused, likelihood-focused, and hybrid strategies across rigorous designs (including single-headed Split MNIST and combinations with coresets and GAN-based replay), the authors demonstrate the need for richer evaluation protocols. The work argues for a community-wide shift toward robust, multi-faceted benchmarking to ensure progress translates to real-world continual learning settings.

Abstract

Experiments used in current continual learning research do not faithfully assess fundamental challenges of learning continually. Instead of assessing performance on challenging and representative experiment designs, recent research has focused on increased dataset difficulty, while still using flawed experiment set-ups. We examine standard evaluations and show why these evaluations make some continual learning approaches look better than they are. We introduce desiderata for continual learning evaluations and explain why their absence creates misleading comparisons. Based on our desiderata we then propose new experiment designs which we demonstrate with various continual learning approaches and datasets. Our analysis calls for a reprioritization of research effort by the community.

Towards Robust Evaluations of Continual Learning

TL;DR

The paper critiques current continual learning evaluations as biased and unrepresentative of real-world challenges. It introduces five core desiderata for robust benchmarks and empirically shows that many leading approaches fail when evaluated under all desiderata, often favoring prior-focused methods. By comparing prior-focused, likelihood-focused, and hybrid strategies across rigorous designs (including single-headed Split MNIST and combinations with coresets and GAN-based replay), the authors demonstrate the need for richer evaluation protocols. The work argues for a community-wide shift toward robust, multi-faceted benchmarking to ensure progress translates to real-world continual learning settings.

Abstract

Experiments used in current continual learning research do not faithfully assess fundamental challenges of learning continually. Instead of assessing performance on challenging and representative experiment designs, recent research has focused on increased dataset difficulty, while still using flawed experiment set-ups. We examine standard evaluations and show why these evaluations make some continual learning approaches look better than they are. We introduce desiderata for continual learning evaluations and explain why their absence creates misleading comparisons. Based on our desiderata we then propose new experiment designs which we demonstrate with various continual learning approaches and datasets. Our analysis calls for a reprioritization of research effort by the community.

Paper Structure

This paper contains 47 sections, 2 equations, 19 figures.

Figures (19)

  • Figure 1: We contrast prior-focused with likelihood-focused continual learning. By comparing these, and hybrid forms, we can see which evaluations pose a bigger challenge to different approaches.
  • Figure 2: Single-headed Split MNIST. This experiment meets all core desiderata and shows big performance differences.
  • Figure 3: Single-headed Split Fashion MNIST. The harder dataset shows the prior-approximation starting to deteriorate, but does not change performance ranking.
  • Figure 4: Multi-headed Split MNIST. All methods succeed.
  • Figure 5: Multi-headed Split FashionMNIST. All perform similarly, so no clean differentiation. VCL performs slightly worse without coresets.
  • ...and 14 more figures