Table of Contents
Fetching ...

Forgetting, Ignorance or Myopia: Revisiting Key Challenges in Online Continual Learning

Xinrui Wang, Chuanxing Geng, Wenhai Wan, Shao-yuan Li, Songcan Chen

TL;DR

The Non-sparse Classifier Evolution framework (NsCE) is proposed to facilitate effective global discriminative feature learning with minimal time cost, integrating non-sparse maximum separation regularization and targeted experience replay techniques with the help of pre-trained models, enabling rapid acquisition of new globally discriminative features.

Abstract

Online continual learning requires the models to learn from constant, endless streams of data. While significant efforts have been made in this field, most were focused on mitigating the catastrophic forgetting issue to achieve better classification ability, at the cost of a much heavier training workload. They overlooked that in real-world scenarios, e.g., in high-speed data stream environments, data do not pause to accommodate slow models. In this paper, we emphasize that model throughput -- defined as the maximum number of training samples that a model can process within a unit of time -- is equally important. It directly limits how much data a model can utilize and presents a challenging dilemma for current methods. With this understanding, we revisit key challenges in OCL from both empirical and theoretical perspectives, highlighting two critical issues beyond the well-documented catastrophic forgetting: Model's ignorance: the single-pass nature of OCL challenges models to learn effective features within constrained training time and storage capacity, leading to a trade-off between effective learning and model throughput; Model's myopia: the local learning nature of OCL on the current task leads the model to adopt overly simplified, task-specific features and excessively sparse classifier, resulting in the gap between the optimal solution for the current task and the global objective. To tackle these issues, we propose the Non-sparse Classifier Evolution framework (NsCE) to facilitate effective global discriminative feature learning with minimal time cost. NsCE integrates non-sparse maximum separation regularization and targeted experience replay techniques with the help of pre-trained models, enabling rapid acquisition of new globally discriminative features.

Forgetting, Ignorance or Myopia: Revisiting Key Challenges in Online Continual Learning

TL;DR

The Non-sparse Classifier Evolution framework (NsCE) is proposed to facilitate effective global discriminative feature learning with minimal time cost, integrating non-sparse maximum separation regularization and targeted experience replay techniques with the help of pre-trained models, enabling rapid acquisition of new globally discriminative features.

Abstract

Online continual learning requires the models to learn from constant, endless streams of data. While significant efforts have been made in this field, most were focused on mitigating the catastrophic forgetting issue to achieve better classification ability, at the cost of a much heavier training workload. They overlooked that in real-world scenarios, e.g., in high-speed data stream environments, data do not pause to accommodate slow models. In this paper, we emphasize that model throughput -- defined as the maximum number of training samples that a model can process within a unit of time -- is equally important. It directly limits how much data a model can utilize and presents a challenging dilemma for current methods. With this understanding, we revisit key challenges in OCL from both empirical and theoretical perspectives, highlighting two critical issues beyond the well-documented catastrophic forgetting: Model's ignorance: the single-pass nature of OCL challenges models to learn effective features within constrained training time and storage capacity, leading to a trade-off between effective learning and model throughput; Model's myopia: the local learning nature of OCL on the current task leads the model to adopt overly simplified, task-specific features and excessively sparse classifier, resulting in the gap between the optimal solution for the current task and the global objective. To tackle these issues, we propose the Non-sparse Classifier Evolution framework (NsCE) to facilitate effective global discriminative feature learning with minimal time cost. NsCE integrates non-sparse maximum separation regularization and targeted experience replay techniques with the help of pre-trained models, enabling rapid acquisition of new globally discriminative features.
Paper Structure (21 sections, 3 theorems, 13 equations, 15 figures, 12 tables, 1 algorithm)

This paper contains 21 sections, 3 theorems, 13 equations, 15 figures, 12 tables, 1 algorithm.

Key Result

Theorem 4.1

For any distributions $\mu_1,...,\mu_T$ over $\mathcal{Z}$, let $\mathcal{D}_t$ be an iid set with $m_t=\min(v_s, v_m)\Delta_t$ samples sampled from $\mu_t$ as the dataset of task $t$, for any $\lambda>0$ and any online predictive sequence $(Q_0, Q_1, ..., Q_T)$, the following inequality holds with

Figures (15)

  • Figure 1: Real-time accuracy of OCL models trained under the standard cross entropy loss $L_{ce}$ both with and without pre-trained models (pre-trained on ImageNet) under our designed single task setting and the impact of some commomly used strategieschaudhry2019tinymai2021supervisedmai2021supervised. Results on additional datasets, influence of different pre-trained models (pre-trained on different datasets, using different backbones and different pre-train tasks) and implementation details are provided in Appendix \ref{['appendix: pretrain']}.
  • Figure 2: Left: Throughput of the model trained using vanilla cross-entropy, experience replaychaudhry2019tiny, supervised contrastive replaymai2021supervised and distillation chainagrawal2024taming. Right: Performance($A_{AUC}$: Area Under the Curve of Accuracy) and running time of the above strategies on CIFAR10. "CE++" denotes that we compute and perform extra gradient descent per time step to match the delay of the compared-against strategies. All experiments are conducted under single task setting.
  • Figure 3: Normalized confusion matrix of NCM classifier (green) and softmax classifier (CIFAR10) (blue) with ImageNet supervised pre-trained initialization. Due to space limitations, we present a partial training process in the main text. Comprehensive training process is in Appendix\ref{['forgetting']}.
  • Figure 4: Left: averaged weights of the final FC layer for class 0 in CIFAR10. Right:$s(w)$ (lower $s(w)$ stands for increasing sparsity) of the final FC layer for $w^0$ corresponds to class 0 in CIFAR10. During the training of task 5, the class confusion occurs as Figure \ref{['fig: part linear proto confusion']} where model classify "car" as "truck".
  • Figure 5: The detailed normalized confusion matrix (CIFAR10) evolution of our proposed NsCE framework (memory buffer size is 100 and replay frequency is 1/100).
  • ...and 10 more figures

Theorems & Definitions (7)

  • Theorem 4.1
  • Lemma B.1: Adapted from alquier2016properties, Thm 4.1
  • Remark B.2
  • Definition B.3: Stochastic kernelsrivasplata2020pac
  • Definition B.4
  • Theorem B.5
  • proof