Rethinking Momentum Knowledge Distillation in Online Continual Learning

Nicolas Michel; Maorong Wang; Ling Xiao; Toshihiko Yamasaki

Rethinking Momentum Knowledge Distillation in Online Continual Learning

Nicolas Michel, Maorong Wang, Ling Xiao, Toshihiko Yamasaki

TL;DR

This work tackles Online Continual Learning (OCL) by leveraging Momentum Knowledge Distillation (MKD) with an evolving Exponential Moving Average (EMA) teacher to overcome KD-specific challenges in single-pass data streams. By integrating MKD with existing replay-based OCL methods and introducing a plasticity-stability control via the parameter $\alpha$ and a teacher-dependent weight $\lambda_{\alpha}$, the approach yields substantial accuracy gains (over $10$ percentage points on ImageNet100) and improves stability, backward transfer, and feature discrimination. The paper also provides detailed ablations and analyses of boundary conditions, showing MKD effectively handles both clear and blurry task boundaries and reduces several known issues in OCL such as task-recency bias and feature drift. The results demonstrate that KD, when rethought as MKD with an evolving teacher, becomes a central, efficient component for advancing OCL performance in a model- and architecture-agnostic manner.

Abstract

Online Continual Learning (OCL) addresses the problem of training neural networks on a continuous data stream where multiple classification tasks emerge in sequence. In contrast to offline Continual Learning, data can be seen only once in OCL, which is a very severe constraint. In this context, replay-based strategies have achieved impressive results and most state-of-the-art approaches heavily depend on them. While Knowledge Distillation (KD) has been extensively used in offline Continual Learning, it remains under-exploited in OCL, despite its high potential. In this paper, we analyze the challenges in applying KD to OCL and give empirical justifications. We introduce a direct yet effective methodology for applying Momentum Knowledge Distillation (MKD) to many flagship OCL methods and demonstrate its capabilities to enhance existing approaches. In addition to improving existing state-of-the-art accuracy by more than $10\%$ points on ImageNet100, we shed light on MKD internal mechanics and impacts during training in OCL. We argue that similar to replay, MKD should be considered a central component of OCL. The code is available at \url{https://github.com/Nicolas1203/mkd_ocl}.

Rethinking Momentum Knowledge Distillation in Online Continual Learning

TL;DR

and a teacher-dependent weight

, the approach yields substantial accuracy gains (over

percentage points on ImageNet100) and improves stability, backward transfer, and feature discrimination. The paper also provides detailed ablations and analyses of boundary conditions, showing MKD effectively handles both clear and blurry task boundaries and reduces several known issues in OCL such as task-recency bias and feature drift. The results demonstrate that KD, when rethought as MKD with an evolving teacher, becomes a central, efficient component for advancing OCL performance in a model- and architecture-agnostic manner.

Abstract

points on ImageNet100, we shed light on MKD internal mechanics and impacts during training in OCL. We argue that similar to replay, MKD should be considered a central component of OCL. The code is available at \url{https://github.com/Nicolas1203/mkd_ocl}.

Paper Structure (51 sections, 3 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 51 sections, 3 equations, 12 figures, 10 tables, 1 algorithm.

Introduction
Related Work
KD in CL
KD in Offline CL
KD in Online CL
Blurry Task Boundaries
Evaluation Metrics
Challenges of KD in OCL
Teacher Quality
Teacher Quantity
Unknown Task Boundaries
Methodology
Motivations
Momentum Knowledge Distillation
Rethinking MKD
...and 36 more sections

Figures (12)

Figure 1: Overview of our MKD framework when applied to a baseline OCL method. Contrary to taking a snapshot at the end of each task, dynamic teacher address the key obstacles in OCL: teacher quality, teacher quantity, and unknown task boundaries.
Figure 2: Illustration of the blurry boundary setting (bottom row) in opposition to the clear boundary setting (top row). Detecting task change in the case of blurry is not trivial.
Figure 3: Impact of $\alpha$ on the plasticity-stability trade-off. Lower $\alpha$ values imply a stable teacher with high performances on old tasks. Higher $\alpha$ implies a plastic teacher, with high performances on new tasks.
Figure 4: Impact of $\lambda_\alpha$ and $\alpha$ on the final performances or ER on CIFAR100 M=5k, clear setting.
Figure 5: Relation between $\log{\alpha}$ and and the best corresponding $\lambda_\alpha$ value, $\lambda_{best}$. The displayed relation is linear.
...and 7 more figures

Rethinking Momentum Knowledge Distillation in Online Continual Learning

TL;DR

Abstract

Rethinking Momentum Knowledge Distillation in Online Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)