Table of Contents
Fetching ...

Hyperparameters in Continual Learning: A Reality Check

Sungmin Cha, Kyunghyun Cho

TL;DR

This work questions the validity of conventional continual learning evaluation, which tunes hyperparameters within a fixed CL scenario and then reports results, arguing that it overestimates CL capacity. It introduces the Generalizable Two-phase Evaluation Protocol (GTEP), separating hyperparameter tuning on dataset $D^{HT}$ from evaluation on dataset $D^{E}$ while keeping the same scenario configuration, to measure generalization to unseen CL scenarios. Across roughly 8,000 experiments in class-IL with and without pretrained models, the study finds that many state-of-the-art methods do not generalize well under GTEP, with DER showing comparatively stronger cross-scenario generalization and newer methods exhibiting sensitivity and higher costs. The findings advocate a shift toward more rigorous, generalizable evaluation standards in CL to ensure robust and scalable methods, and outline future directions including extending GTEP to online CL and other domains while pursuing more sample-efficient hyperparameter tuning. $D^{HT}$, $D^{E}$, $P^{HT}$, and $P^{E}$ are central quantities, reinforcing the emphasis on cross-scenario generalization in CL.

Abstract

Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.

Hyperparameters in Continual Learning: A Reality Check

TL;DR

This work questions the validity of conventional continual learning evaluation, which tunes hyperparameters within a fixed CL scenario and then reports results, arguing that it overestimates CL capacity. It introduces the Generalizable Two-phase Evaluation Protocol (GTEP), separating hyperparameter tuning on dataset from evaluation on dataset while keeping the same scenario configuration, to measure generalization to unseen CL scenarios. Across roughly 8,000 experiments in class-IL with and without pretrained models, the study finds that many state-of-the-art methods do not generalize well under GTEP, with DER showing comparatively stronger cross-scenario generalization and newer methods exhibiting sensitivity and higher costs. The findings advocate a shift toward more rigorous, generalizable evaluation standards in CL to ensure robust and scalable methods, and outline future directions including extending GTEP to online CL and other domains while pursuing more sample-efficient hyperparameter tuning. , , , and are central quantities, reinforcing the emphasis on cross-scenario generalization in CL.

Abstract

Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol. Our implementation can be found in https://github.com/csm9493/GTEP.
Paper Structure (19 sections, 21 figures, 22 tables, 3 algorithms)

This paper contains 19 sections, 21 figures, 22 tables, 3 algorithms.

Figures (21)

  • Figure 1: This figure illustrates the conventional evaluation protocol. First, a CL scenario is constructed using a benchmark dataset, where each task has its own training, validation, and test sets. Second, to find the best hyperparameters $\mathcal{H}^*$, a model is sequentially trained up to the final task using sampled hyperparameters. After training for each task $t$, the model $\theta_t$ is evaluated using the validation dataset. This process is repeated across various hyperparameter settings, and the best hyperparameters $\mathcal{H}^*$ are selected based on a performance metric. Finally, a new model is trained using the CL algorithm with the best hyperparameters $\mathcal{H}^*$ in the same CL scenario, and report the evaluation result on the test dataset. Note that in most studies, results are reported using only $D_{\text{val}}$, without a separate test set (i.e., $D_{\text{te}} = D_{\text{val}}$) zhou2023pycilsun2023pilot.
  • Figure 2: Results on both phases.
  • Figure 3: Illustration of the proposed evaluation protocol. Both phases share the same CL scenario configuration (e.g., the number of tasks and number of classes in each task) but they are generated from distinct datasets ($D^{HT}$ and $D^E$). Best hyperparameters are selected in the hyperparameter tuning phase. Then, the evaluation phase access a CL algorithm by training a model using them. Note that evaluating an algorithm solely based on the results from the hyperparameter tuning phase is identical to the conventional evaluation protocol without using $D^{E}$.
  • Figure 4: Experimental results (AvgAcc) on the 10 Tasks scenario using ImageNet-100-1 for $D^{HT}$ and ImageNet-100-2 for $D^{E}$ (high similarity). The term Original and $\mathcal{H}^*$ refer to the use of reported hyperparameters and hyperparameters selected from our protocol, respectively. BEEF constantly returns NaN in training loss at specific seeds so we could not report its performance.
  • Figure 5: Bar graphs depict the experimental results from the evaluation phase. The Y-axis represents final classification accuracy (Acc). The parentheses next to each algorithm indicate the publication year. The bar graphs in the first row show the experimental results using the best hyerparameters selected in the hyperparameter tuning phase with $D^{HT}= \text{CIFAR-50-1}$ , while the graphs in the second row show the results using $D^{HT}= \text{ImageNet-50-1}$ . In cases of using ImageNet-50-1 or ImageNet-50-2, we encountered challenges in obtaining results for BEEF due to NaN issues.
  • ...and 16 more figures