Table of Contents
Fetching ...

Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation

HongWei Yan, Liyuan Wang, Kaisheng Ma, Yi Zhong

TL;DR

The paper tackles Online Continual Learning (OCL), where models learn from one-pass data streams and must balance learning new tasks with preserving past knowledge. It introduces MOSE, a framework that orchestrates latent, multi-level expertise through Multi-Level Supervision (MLS) and Reverse Self-Distillation (RSD), enabling the model to learn hierarchical features and transfer knowledge across depth-wise experts. Empirical results on Split CIFAR-100 and Split Tiny-ImageNet show MOSE substantially outperforms state-of-the-art baselines, with significant gains in average accuracy and reduced forgetting; the MOE variant further amplifies these gains. By addressing the overfitting-underfitting dilemma with internal, task-aware distillation and multi-scale supervision, MOSE offers a scalable and efficient path toward robust online continual learning in dynamic environments.

Abstract

To accommodate real-world dynamics, artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task, Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However, a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end, we introduce a novel approach, Multi-level Online Sequential Experts (MOSE), which cultivates the model as stacked sub-experts, integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts, thereby significantly advancing OCL performance over state-of-the-art baselines (e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).

Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation

TL;DR

The paper tackles Online Continual Learning (OCL), where models learn from one-pass data streams and must balance learning new tasks with preserving past knowledge. It introduces MOSE, a framework that orchestrates latent, multi-level expertise through Multi-Level Supervision (MLS) and Reverse Self-Distillation (RSD), enabling the model to learn hierarchical features and transfer knowledge across depth-wise experts. Empirical results on Split CIFAR-100 and Split Tiny-ImageNet show MOSE substantially outperforms state-of-the-art baselines, with significant gains in average accuracy and reduced forgetting; the MOE variant further amplifies these gains. By addressing the overfitting-underfitting dilemma with internal, task-aware distillation and multi-scale supervision, MOSE offers a scalable and efficient path toward robust online continual learning in dynamic environments.

Abstract

To accommodate real-world dynamics, artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task, Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However, a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end, we introduce a novel approach, Multi-level Online Sequential Experts (MOSE), which cultivates the model as stacked sub-experts, integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts, thereby significantly advancing OCL performance over state-of-the-art baselines (e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).
Paper Structure (26 sections, 11 equations, 11 figures, 11 tables, 1 algorithm)

This paper contains 26 sections, 11 equations, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overfitting-Underfitting Dilemma. We show the impact of training different epochs to the test accuracy of task 1 and task 2 of Split CIFAR-100 cifar dataset, as well as BOF value for joint training on the buffer of the last task. (a) shows the test accuracy of task 1 when the training of task 2 has just finished ($t=2$). Similarly, (b) and (c) show the test accuracy of task 2, and the average performance of the first two tasks when $t=2$. (d) shows the buffer overfitting problem across different epochs. aug here is a combination of 3 different data augmentation used in simclrOCM. For a fair comparison, we fix $M=1K$, batch size $B=10$, and buffer batch size $B^{\mathcal{M}}=64$.
  • Figure 2: Illustration of the proposed MOSE. For each training sample $(x,y)$, the input $x$ is augmented to another view $x^{\prime}$ and concatenate together for network training. MOSE includes multiple supervision signals (cross-entropy and supervised contrastive loss) injected at different network layers and extra reverse self-distillation from the shallower layers to the deepest to integrate the knowledge of experts.
  • Figure 3: Different Number of Experts. We divide the ResNet18 backbone into a few components according to its block-wise structure, evaluated under four different memory buffer sizes.
  • Figure 4: Overfitting-Underfitting Test. These two subfigures exhibit (a) the test accuracy of each new task $t$ when it is trained; and (b) the average BOF value of old tasks after learning each task.
  • Figure 5: Average Accuracy with Different Student Expert. Here presents the average test accuracy $\mathbb{E}_{i\leq t} a_{i, t}$ at different task $t$ during training. E1, E2, E3, and E4 denote the student experts used in our proposed RSD. (a), (b) and (c) are accuracy results of corresponding student expert; (d), (e) and (f) are accuracy results of their MOE version, which is the accuracy of averaged output logits across all experts.
  • ...and 6 more figures