Table of Contents
Fetching ...

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, Zsolt Kira

TL;DR

The paper tackles the difficulty of comparing continual learning methods due to diverse evaluation setups, proposing a uniform framework and a concise taxonomy of task differences using $P(X)$ and $P(Y)$, as well as output-space sharing. It introduces a systematic way to generate task sequences via permutation and splitting and demonstrates that simple baselines (e.g., Adagrad, $L_2$ regularization) and naive rehearsal can compete with or exceed many state-of-the-art methods under equal memory budgets. The work finds that incremental task learning is generally easier than incremental class learning, while longer task queues stress regularization-based approaches and heavy hyperparameter tuning is often needed. It also provides a publicly available PyTorch framework to enable fair benchmarking and suggests directions toward harder, task-identity-agnostic evaluation scenarios. Overall, the study emphasizes robust benchmarking and practical baselines, guiding future research toward more realistic continual learning evaluations and stronger baselines.

Abstract

Continual learning has received a great deal of attention recently with several approaches being proposed. However, evaluations involve a diverse set of scenarios making meaningful comparison difficult. This work provides a systematic categorization of the scenarios and evaluates them within a consistent framework including strong baselines and state-of-the-art methods. The results provide an understanding of the relative difficulty of the scenarios and that simple baselines (Adagrad, L2 regularization, and naive rehearsal strategies) can surprisingly achieve similar performance to current mainstream methods. We conclude with several suggestions for creating harder evaluation scenarios and future research directions. The code is available at https://github.com/GT-RIPL/Continual-Learning-Benchmark

Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines

TL;DR

The paper tackles the difficulty of comparing continual learning methods due to diverse evaluation setups, proposing a uniform framework and a concise taxonomy of task differences using and , as well as output-space sharing. It introduces a systematic way to generate task sequences via permutation and splitting and demonstrates that simple baselines (e.g., Adagrad, regularization) and naive rehearsal can compete with or exceed many state-of-the-art methods under equal memory budgets. The work finds that incremental task learning is generally easier than incremental class learning, while longer task queues stress regularization-based approaches and heavy hyperparameter tuning is often needed. It also provides a publicly available PyTorch framework to enable fair benchmarking and suggests directions toward harder, task-identity-agnostic evaluation scenarios. Overall, the study emphasizes robust benchmarking and practical baselines, guiding future research toward more realistic continual learning evaluations and stronger baselines.

Abstract

Continual learning has received a great deal of attention recently with several approaches being proposed. However, evaluations involve a diverse set of scenarios making meaningful comparison difficult. This work provides a systematic categorization of the scenarios and evaluates them within a consistent framework including strong baselines and state-of-the-art methods. The results provide an understanding of the relative difficulty of the scenarios and that simple baselines (Adagrad, L2 regularization, and naive rehearsal strategies) can surprisingly achieve similar performance to current mainstream methods. We conclude with several suggestions for creating harder evaluation scenarios and future research directions. The code is available at https://github.com/GT-RIPL/Continual-Learning-Benchmark

Paper Structure

This paper contains 8 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The three continual learning scenarios generated by Split MNIST. In each sub-figure, the left dotted rectangle represents the inputs for training, in that $(x,y,t)$ means (input image, target class label, input task identity). The right side illustrates the neural network model and the predicted $P(Y)$ of the model. The color of each bar in the categorical distribution maps to a specific output node in the classifier. Note that Split MNIST generates five splits in sequence (0/1, 2/3, 4/5, 6/7, 8/9) for task $T_1$ to $T_5$ while here we only demonstrate the differences between $T_1$ and $T_2$.
  • Figure 2: The three continual learning scenarios generated by Permuted MNIST. In each sub-figure, the left dotted rectangle represents the inputs for training, in that $(x,y,t)$ means (input image, target class label, input task identity). The right side illustrates the neural network model and the predicted $P(Y)$ of the model. The color of each bar in the categorical distribution maps to a specific output node in the classifier. Note that the task sequence has 10 different permutations for task $T_1$ to $T_{10}$ while here we only demonstrate the differences between $T_1$ and $T_2$.
  • Figure 3: A comparison between MLP and CNN models. In each subfigure, we list SI and Online EWC with best and worst hyper-parameter selections (solid line) and two additional optimization methods (dashed line).
  • Figure 4: Sensitivity to regularization weight. Top row represents the results of SI, and the bottom row represents the results of Online EWC. Different initialization methods are used in different column.
  • Figure 5: Sensitivity to initialization method. Top row represents the results of SI, and the bottom row represents the results of Online EWC. Different regularization weights are used in different columns.