Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Vaibhav Singh; Rahaf Aljundi; Eugene Belilovsky

Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Vaibhav Singh, Rahaf Aljundi, Eugene Belilovsky

TL;DR

DoSAPP addresses forgetting in class-incremental continual learning for Vision-Language Models by interleaving supervised adaptation with unsupervised test-time retention, all without replay buffers. It uses a two-phase framework built on a Teacher-Student EMA with gradient-based sparse updates and affine-projected dual momentum to balance plasticity and stability, plus TTL with pseudo-labels and a union of masks to stabilize learning. Empirical results across five datasets show DoSAPP achieves state-of-the-art average accuracy and minimal forgetting compared to both CL and CTTA baselines, including when task boundaries are unknown. The approach demonstrates a memory-free, scalable mechanism to leverage unlabeled test-time data for preserving prior knowledge in dynamic deployment scenarios, with practical considerations and potential extensions to other modalities.

Abstract

Foundational Vision-Language Models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled testtime data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple Teacher-Student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

TL;DR

Abstract

Paper Structure (26 sections, 13 equations, 3 figures, 16 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 3 figures, 16 tables, 1 algorithm.

Introduction
Related Work
Methodology
DoSAPP: Double Smoothing via Affine Projected Parameters
Experiments
Setup
Results
Comparison with CL Methods
Comparison with TTA+CL Methods
Class Incremental Long Sequence scenario with domain shift
Ablation Study
Discussion and Conclusion
Appendix / supplemental material
Derivation for dual momentum
Effect of Momentum ($\gamma, \lambda$) on Average Accuracy
...and 11 more sections

Figures (3)

Figure 1: An illustration of our proposed setting of Continual Learning with Interleaved Test Time Learning. After each supervised training session, the model is deployed to adapt in an unsupervised deployment phase, where it encounters data from both current and previously seen tasks. During this phase, the model adapts to the current task's classes while striving to preserve performance on earlier tasks, thereby mitigating forgetting.
Figure 2: DoSAPP employs Teacher-Student ($\mathcal{M}_T,\; \mathcal{M}_S$) models respectively. In the Supervised Continual Learning phase, $\mathcal{M}_S$ performs sparse parameter selection using a gradient-based scoring function $\mathcal{F}$, followed by training on the selected parameters $\boldsymbol{\theta}^{\textbf{m}} \in \boldsymbol{\theta}^S$. After each update, $\mathcal{M}_T$ parameters $\boldsymbol{\theta}^T$ are updated through weighted exponential smoothing based on the affine projection of the boolean mask $\textbf{m}$, controlled by dual momentum terms $\delta$ and $\gamma$ for $\mathcal{M}_T$ and $\mathcal{M}_S$, respectively. In the unsupervised test-time learning phase, $\mathcal{M}_S$ adapts using "pseudo-label" derived from $\mathcal{M}_T$-$\mathcal{M}_S$ logits comparison. $\mathcal{M}_T$ then undergoes weighted smoothing again, with momentum terms $\delta$ and $\lambda$ for $\mathcal{M}_T$ and $\mathcal{M}_S$ (where $\gamma<\lambda<\delta$). This two-phase approach ensures generalization over previous knowledge while maintaining adaptability to new tasks.
Figure 3: Per-task forgetting matrices for the long-sequence CIL setting (Aircraft $\rightarrow$ Cars). Each heatmap shows $F_{t,i} = R_{i,i}- R_{t,i}$, i.e., how much performance on task $i$ is lost after learning later tasks. The vertical and horizontal black lines denote the domain shift from Aircraft (left/top) to Cars (right/bottom). DoSAPP achieves the lowest forgetting across the entire task sequence, with several tasks even exhibiting negative forgetting, indicating improved retention as training progresses. In contrast, SPU shows moderate degradation, while standard finetuning undergoes severe catastrophic forgetting in both domains.

Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

TL;DR

Abstract

Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Authors

TL;DR

Abstract

Table of Contents

Figures (3)