Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines
Yen-Chang Hsu, Yen-Cheng Liu, Anita Ramasamy, Zsolt Kira
TL;DR
The paper tackles the difficulty of comparing continual learning methods due to diverse evaluation setups, proposing a uniform framework and a concise taxonomy of task differences using $P(X)$ and $P(Y)$, as well as output-space sharing. It introduces a systematic way to generate task sequences via permutation and splitting and demonstrates that simple baselines (e.g., Adagrad, $L_2$ regularization) and naive rehearsal can compete with or exceed many state-of-the-art methods under equal memory budgets. The work finds that incremental task learning is generally easier than incremental class learning, while longer task queues stress regularization-based approaches and heavy hyperparameter tuning is often needed. It also provides a publicly available PyTorch framework to enable fair benchmarking and suggests directions toward harder, task-identity-agnostic evaluation scenarios. Overall, the study emphasizes robust benchmarking and practical baselines, guiding future research toward more realistic continual learning evaluations and stronger baselines.
Abstract
Continual learning has received a great deal of attention recently with several approaches being proposed. However, evaluations involve a diverse set of scenarios making meaningful comparison difficult. This work provides a systematic categorization of the scenarios and evaluates them within a consistent framework including strong baselines and state-of-the-art methods. The results provide an understanding of the relative difficulty of the scenarios and that simple baselines (Adagrad, L2 regularization, and naive rehearsal strategies) can surprisingly achieve similar performance to current mainstream methods. We conclude with several suggestions for creating harder evaluation scenarios and future research directions. The code is available at https://github.com/GT-RIPL/Continual-Learning-Benchmark
