Consistency-Guided Asynchronous Contrastive Tuning for Few-Shot Class-Incremental Tuning of Foundation Models
Shuvendu Roy, Elham Dolatabadi, Arash Afkanpour, Ali Etemad
TL;DR
This work tackles updating vision foundation models to learn new classes from few examples without catastrophic forgetting. It introduces Consistency-guided Asynchronous Contrastive Tuning (CoACT), comprising asynchronous contrastive tuning with LoRA adapters and an EMA teacher, a two-stage controlled fine-tuning protocol, and a consistency-based regularizer for subsequent sessions, formalized in the FSCIT setting where no large base session is assumed. Across 16 diverse datasets, CoACT achieves state-of-the-art performance in both FSCIL (up to 5.02% gains) and FSCIT (up to 12.51% gains) with an average improvement of 2.47%, while maintaining robustness in low-shot regimes and reducing forgetting. Comprehensive ablations confirm the contribution of each component, and analyses show favorable efficiency compared to prior methods, supported by public code release.
Abstract
We propose Consistency-guided Asynchronous Contrastive Tuning (CoACT), a novel method for continuously tuning foundation models to learn new classes in few-shot settings. CoACT consists of three key components:(i) asynchronous contrastive tuning, which learns new classes by including LoRA modules in the pre-trained encoder while enforcing consistency between two asynchronous encoders; (ii) controlled fine-tuning, which facilitates effective tuning of a subset of the foundation model; and (iii) consistency-guided incremental tuning, which enforces additional regularization during later sessions to reduce forgetting of the learned classes. We evaluate our proposed solution on Few-Shot Class-Incremental Learning (FSCIL) as well as a new and more challenging setup called Few-Shot Class-Incremental Tuning (FSCIT), which facilitates the continual tuning of vision foundation models to learn new classes with only a few samples per class. Unlike traditional FSCIL, FSCIT does not require a large in-distribution base session for initial fully supervised training prior to the incremental few-shot sessions. We conduct extensive evaluations across 16 diverse datasets, demonstrating the effectiveness of CoACT in both FSCIL and FSCIT setups. CoACT outperforms existing methods by up to 5.02% in FSCIL and up to 12.51% in FSCIT for individual datasets, with an average improvement of 2.47%. Furthermore, CoACT exhibits reduced forgetting and enhanced robustness in low-shot experiments. Detailed ablation and sensitivity studies highlight the contribution of each component of CoACT. We make our code publicly available at https://github.com/ShuvenduRoy/CoACT-FSCIL.
