Online Distillation with Continual Learning for Cyclic Domain Shifts

Joachim Houyon; Anthony Cioppa; Yasir Ghunaim; Motasem Alfarra; Anaïs Halin; Maxim Henry; Bernard Ghanem; Marc Van Droogenbroeck

Online Distillation with Continual Learning for Cyclic Domain Shifts

Joachim Houyon, Anthony Cioppa, Yasir Ghunaim, Motasem Alfarra, Anaïs Halin, Maxim Henry, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR

The paper tackles catastrophic forgetting in online distillation under cyclic domain shifts by integrating continual learning (both replay-based and regularization-based) into the online distillation pipeline. It defines a cyclic online continual learning setting and evaluates it on long, untrimmed video streams where a fast student learns from a slow, accurate teacher via pseudo-labels. The authors show that replay-based methods, particularly MIR and MIR+RWalk, substantially mitigate forgetting and improve both backward and forward transfer, while some regularizers can hinder online adaptation. Overall, the approach enhances real-time perception robustness for applications like autonomous driving and video surveillance, representing a significant step toward practical online continual learning for cyclical domain changes.

Abstract

In recent years, online distillation has emerged as a powerful technique for adapting real-time deep neural networks on the fly using a slow, but accurate teacher model. However, a major challenge in online distillation is catastrophic forgetting when the domain shifts, which occurs when the student model is updated with data from the new domain and forgets previously learned knowledge. In this paper, we propose a solution to this issue by leveraging the power of continual learning methods to reduce the impact of domain shifts. Specifically, we integrate several state-of-the-art continual learning methods in the context of online distillation and demonstrate their effectiveness in reducing catastrophic forgetting. Furthermore, we provide a detailed analysis of our proposed solution in the case of cyclic domain shifts. Our experimental results demonstrate the efficacy of our approach in improving the robustness and accuracy of online distillation, with potential applications in domains such as video surveillance or autonomous driving. Overall, our work represents an important step forward in the field of online distillation and continual learning, with the potential to significantly impact real-world applications.

Online Distillation with Continual Learning for Cyclic Domain Shifts

TL;DR

Abstract

Paper Structure (12 sections, 7 equations, 4 figures, 1 table)

This paper contains 12 sections, 7 equations, 4 figures, 1 table.

Introduction
Related Work
Methodology
Online distillation framework
Replay-based methods
Regularization-based methods
Evaluation methodology
Experiments
Experimental setup
Quantitative results
Qualitative results
Conclusion

Figures (4)

Figure 1: Online distillation with continual learning. When cyclic domain shifts occur in long videos, the online distillation framework proposed by Cioppa et al. Cioppa2019ARTHuS forgets the previously acquired knowledge as it fine-tunes on the current domain. In this work, we study the inclusion of state-of-the-art continual learning methods inside the online distillation framework to mitigate this catastrophic forgetting around the domain shifts.
Figure 2: Online distillation. The framework is composed of a fast and a slow route. In the fast route (inference), the video stream $\mathcal{V}$ is processed by a student network $\mathbf{S}$ on a task $\mathcal{T}$ (e.g., semantic segmentation for autonomous driving) and produces predictions $\hat{y}_i$ for each frame of the video $x_i$ at the original video rate $r_\mathcal{V}$ (i.e., in real time). In parallel in the slow route (training), a frozen teacher $\mathbf{T}$ produces pseudo ground-truths $\tilde{y}_{i'}$ from a subset of frames $x_{i'}$ at a slower rate $r_\mathbf{T}$. The pair $(x_{i'},\tilde{y}_{i'})$ are then stored in an online dataset (or replay buffer) $\mathcal{D}$ through an update function $f_U$. $\mathcal{D}$ is sampled through a selection function $f_S$ and the selected pairs ${(x_{n},\tilde{y}_{n})}$ are used to train a copy of the student network $\mathbf{S}_c$ for one epoch using a loss $\mathcal{L}$. The parameters $\theta$ of $\mathbf{S}_c$ are then transferred to $\mathbf{S}$ at a rate $r_{\mathbf{S}_c}$ (corresponding to the inverse of the training time of $\mathbf{S}_c$ on one epoch) so that $\mathbf{S}$ improves on the latest domain of $\mathcal{V}$. One of the contribution of our paper consists in including replay-based Continual Learning (CL) methods, $CL_{Rep}$, inside $\mathcal{D}$ and regularization-based methods, $CL_{Reg}$, on $\mathcal{L}$.
Figure 3: Evolution of the performance over time. We compare the evolution with respect to $mIoU$, BWT, Final-BWT, and FWT of the MIR+RWalk method with the original online distillation framework (baseline). (Top-left) $mIoU$: the performances are mostly similar within the domain, but around the domain shifts (from the second cycle), the baseline suffers from forgetting while MIR+RWalk keeps high performance. (Bottom-left) BWT: when evaluating on the previous domain, MIR+RWalk clearly outperforms the baseline, showing that it is able to retain information about the previous domain, on frames it has trained on. (Top-right) Final-BWT: the baseline quickly forgets past knowledge, while MIR+RWalk is able to retain high performance for both domains across many cycles. (Bottom-right) FWT: when evaluating on the future domain, MIR+RWalk also significantly outperforms the baseline, showing that it is able to generalize on new frames of a particular domain using information from a previous domain it has seen before.
Figure 4: Qualitative results. Comparison of the segmentation masks obtained by different online continual learning methods: (top row) a frame taken right after second transition between highway and downtown, and (bottom row) a frame taken right after seventh transition between downtown and highway. The baseline method predicts poor segmentation masks after the domain shift, even though it has already seen this domain before. In contrast, MIR and MIR+RWalk produce better segmentation masks.

Online Distillation with Continual Learning for Cyclic Domain Shifts

TL;DR

Abstract

Online Distillation with Continual Learning for Cyclic Domain Shifts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)