Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention
Vaibhav Singh, Rahaf Aljundi, Eugene Belilovsky
TL;DR
DoSAPP addresses forgetting in class-incremental continual learning for Vision-Language Models by interleaving supervised adaptation with unsupervised test-time retention, all without replay buffers. It uses a two-phase framework built on a Teacher-Student EMA with gradient-based sparse updates and affine-projected dual momentum to balance plasticity and stability, plus TTL with pseudo-labels and a union of masks to stabilize learning. Empirical results across five datasets show DoSAPP achieves state-of-the-art average accuracy and minimal forgetting compared to both CL and CTTA baselines, including when task boundaries are unknown. The approach demonstrates a memory-free, scalable mechanism to leverage unlabeled test-time data for preserving prior knowledge in dynamic deployment scenarios, with practical considerations and potential extensions to other modalities.
Abstract
Foundational Vision-Language Models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled testtime data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple Teacher-Student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.
