Table of Contents
Fetching ...

CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models

Shengzhuang Chen, Yikai Liao, Xiaoxiao Sun, Kede Ma, Ying Wei

TL;DR

Problem: static CL benchmarks fail to reflect real-world dynamics and suffer from data contamination when evaluated with foundation models. Approach: CLDyB reframes CL evaluation as an infinite-state MDP and solves it with action-space reduction via greedy sampling and clustering, plus MCTS-guided task selection, demonstrating joint and per-method analyses on a ViT backbone. Key findings: CLDyB sequences are consistently harder than random orders and expose distinct strengths and weaknesses across nine SOTA CL methods, with potential to mitigate data contamination by focusing on challenging tasks and expanding the data pool with AI-generated data. Impact: CLDyB offers a dynamic, algorithm-aware benchmarking framework that better aligns CL progress with open-world performance and motivates future development of more robust, adaptive continual learning strategies.

Abstract

The advent of the foundation model era has sparked significant research interest in leveraging pre-trained representations for continual learning (CL), yielding a series of top-performing CL methods on standard evaluation benchmarks. Nonetheless, there are growing concerns regarding potential data contamination during the pre-training stage. Furthermore, standard evaluation benchmarks, which are typically static, fail to capture the complexities of real-world CL scenarios, resulting in saturated performance. To address these issues, we describe CL on dynamic benchmarks (CLDyB), a general computational framework based on Markov decision processes for evaluating CL methods reliably. CLDyB dynamically identifies inherently difficult and algorithm-dependent tasks for the given CL methods, and determines challenging task orders using Monte Carlo tree search. Leveraging CLDyB, we first conduct a joint evaluation of multiple state-of-the-art CL methods, leading to a set of commonly challenging and generalizable task sequences where existing CL methods tend to perform poorly. We then conduct separate evaluations of individual CL methods using CLDyB, discovering their respective strengths and weaknesses. The source code and generated task sequences are publicly accessible at https://github.com/szc12153/CLDyB.

CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models

TL;DR

Problem: static CL benchmarks fail to reflect real-world dynamics and suffer from data contamination when evaluated with foundation models. Approach: CLDyB reframes CL evaluation as an infinite-state MDP and solves it with action-space reduction via greedy sampling and clustering, plus MCTS-guided task selection, demonstrating joint and per-method analyses on a ViT backbone. Key findings: CLDyB sequences are consistently harder than random orders and expose distinct strengths and weaknesses across nine SOTA CL methods, with potential to mitigate data contamination by focusing on challenging tasks and expanding the data pool with AI-generated data. Impact: CLDyB offers a dynamic, algorithm-aware benchmarking framework that better aligns CL progress with open-world performance and motivates future development of more robust, adaptive continual learning strategies.

Abstract

The advent of the foundation model era has sparked significant research interest in leveraging pre-trained representations for continual learning (CL), yielding a series of top-performing CL methods on standard evaluation benchmarks. Nonetheless, there are growing concerns regarding potential data contamination during the pre-training stage. Furthermore, standard evaluation benchmarks, which are typically static, fail to capture the complexities of real-world CL scenarios, resulting in saturated performance. To address these issues, we describe CL on dynamic benchmarks (CLDyB), a general computational framework based on Markov decision processes for evaluating CL methods reliably. CLDyB dynamically identifies inherently difficult and algorithm-dependent tasks for the given CL methods, and determines challenging task orders using Monte Carlo tree search. Leveraging CLDyB, we first conduct a joint evaluation of multiple state-of-the-art CL methods, leading to a set of commonly challenging and generalizable task sequences where existing CL methods tend to perform poorly. We then conduct separate evaluations of individual CL methods using CLDyB, discovering their respective strengths and weaknesses. The source code and generated task sequences are publicly accessible at https://github.com/szc12153/CLDyB.

Paper Structure

This paper contains 25 sections, 7 equations, 23 figures, 4 tables, 3 algorithms.

Figures (23)

  • Figure 1: System diagram of the proposed CLDyB for dynamically constructing task sequences that challenge current CL methods. At time step $t$, CLDyB first performs candidate task set construction by greedy task sampling (see Eqn. (\ref{['eq:ats']})) and functional task clustering. This results in a reduced and clustered action space $\bar{{\mathcal{A}}}_t$, facilitating task evaluation. It then performs optimal task identification by maximizing the estimated current state value function (see Eqn. (\ref{['eq:search_objective']})) using MCTS, which consists of four steps: 1) Expansion, 2) rollout simulation, 3) backpropagation, and 4) selection. The pseudocode for CLDyB can be found in Algorithm \ref{['alg:pscode-cldy-pipeline']} of the Appendix.
  • Figure 2: Joint evaluation of the five CL methods (represented by solid lines) using CLDyB with generalization to the other four CL methods (represented by dashed lines).
  • Figure 3: Multi-dimensional assessment of CL methods using CLDyB, with higher values on the axes representing better performance. The comparison highlights their respective strengths and weaknesses, especially how they handle the trade-offs across different dimensions.
  • Figure 4: Dendrograms of CL methods. (a) Acc trajectories on the commonly challenging CLDyB sequences. (b) Stack of flattened versions of the 2D task-to-task similarity matrices obtained on individually challenging CLDyB sequences. The task similarity values are normalized by $s(\cdot)$ in Eqn. (\ref{['eq:ats']}) to the range $[0,1]$ for improved visualization. Both dendrograms exhibit noticeable consistency in their hierarchical structures, reflecting commonality in CL methods.
  • Figure 5: Performance comparison of CL methods on upcoming tasks selected by CLDyB from the original and the augmented data pools. Additional diffusion-generated images are added to the data pool at time step $t=2$ (denoted by the dashed line).
  • ...and 18 more figures