Hyperparameters in Continual Learning: A Reality Check
Sungmin Cha, Kyunghyun Cho
TL;DR
This work questions the validity of conventional continual learning evaluation, which tunes hyperparameters within a fixed CL scenario and then reports results, arguing that it overestimates CL capacity. It introduces the Generalizable Two-phase Evaluation Protocol (GTEP), separating hyperparameter tuning on dataset $D^{HT}$ from evaluation on dataset $D^{E}$ while keeping the same scenario configuration, to measure generalization to unseen CL scenarios. Across roughly 8,000 experiments in class-IL with and without pretrained models, the study finds that many state-of-the-art methods do not generalize well under GTEP, with DER showing comparatively stronger cross-scenario generalization and newer methods exhibiting sensitivity and higher costs. The findings advocate a shift toward more rigorous, generalizable evaluation standards in CL to ensure robust and scalable methods, and outline future directions including extending GTEP to online CL and other domains while pursuing more sample-efficient hyperparameter tuning. $D^{HT}$, $D^{E}$, $P^{HT}$, and $P^{E}$ are central quantities, reinforcing the emphasis on cross-scenario generalization in CL.
Abstract
Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.
