Table of Contents
Fetching ...

Consistency-Guided Asynchronous Contrastive Tuning for Few-Shot Class-Incremental Tuning of Foundation Models

Shuvendu Roy, Elham Dolatabadi, Arash Afkanpour, Ali Etemad

TL;DR

This work tackles updating vision foundation models to learn new classes from few examples without catastrophic forgetting. It introduces Consistency-guided Asynchronous Contrastive Tuning (CoACT), comprising asynchronous contrastive tuning with LoRA adapters and an EMA teacher, a two-stage controlled fine-tuning protocol, and a consistency-based regularizer for subsequent sessions, formalized in the FSCIT setting where no large base session is assumed. Across 16 diverse datasets, CoACT achieves state-of-the-art performance in both FSCIL (up to 5.02% gains) and FSCIT (up to 12.51% gains) with an average improvement of 2.47%, while maintaining robustness in low-shot regimes and reducing forgetting. Comprehensive ablations confirm the contribution of each component, and analyses show favorable efficiency compared to prior methods, supported by public code release.

Abstract

We propose Consistency-guided Asynchronous Contrastive Tuning (CoACT), a novel method for continuously tuning foundation models to learn new classes in few-shot settings. CoACT consists of three key components:(i) asynchronous contrastive tuning, which learns new classes by including LoRA modules in the pre-trained encoder while enforcing consistency between two asynchronous encoders; (ii) controlled fine-tuning, which facilitates effective tuning of a subset of the foundation model; and (iii) consistency-guided incremental tuning, which enforces additional regularization during later sessions to reduce forgetting of the learned classes. We evaluate our proposed solution on Few-Shot Class-Incremental Learning (FSCIL) as well as a new and more challenging setup called Few-Shot Class-Incremental Tuning (FSCIT), which facilitates the continual tuning of vision foundation models to learn new classes with only a few samples per class. Unlike traditional FSCIL, FSCIT does not require a large in-distribution base session for initial fully supervised training prior to the incremental few-shot sessions. We conduct extensive evaluations across 16 diverse datasets, demonstrating the effectiveness of CoACT in both FSCIL and FSCIT setups. CoACT outperforms existing methods by up to 5.02% in FSCIL and up to 12.51% in FSCIT for individual datasets, with an average improvement of 2.47%. Furthermore, CoACT exhibits reduced forgetting and enhanced robustness in low-shot experiments. Detailed ablation and sensitivity studies highlight the contribution of each component of CoACT. We make our code publicly available at https://github.com/ShuvenduRoy/CoACT-FSCIL.

Consistency-Guided Asynchronous Contrastive Tuning for Few-Shot Class-Incremental Tuning of Foundation Models

TL;DR

This work tackles updating vision foundation models to learn new classes from few examples without catastrophic forgetting. It introduces Consistency-guided Asynchronous Contrastive Tuning (CoACT), comprising asynchronous contrastive tuning with LoRA adapters and an EMA teacher, a two-stage controlled fine-tuning protocol, and a consistency-based regularizer for subsequent sessions, formalized in the FSCIT setting where no large base session is assumed. Across 16 diverse datasets, CoACT achieves state-of-the-art performance in both FSCIL (up to 5.02% gains) and FSCIT (up to 12.51% gains) with an average improvement of 2.47%, while maintaining robustness in low-shot regimes and reducing forgetting. Comprehensive ablations confirm the contribution of each component, and analyses show favorable efficiency compared to prior methods, supported by public code release.

Abstract

We propose Consistency-guided Asynchronous Contrastive Tuning (CoACT), a novel method for continuously tuning foundation models to learn new classes in few-shot settings. CoACT consists of three key components:(i) asynchronous contrastive tuning, which learns new classes by including LoRA modules in the pre-trained encoder while enforcing consistency between two asynchronous encoders; (ii) controlled fine-tuning, which facilitates effective tuning of a subset of the foundation model; and (iii) consistency-guided incremental tuning, which enforces additional regularization during later sessions to reduce forgetting of the learned classes. We evaluate our proposed solution on Few-Shot Class-Incremental Learning (FSCIL) as well as a new and more challenging setup called Few-Shot Class-Incremental Tuning (FSCIT), which facilitates the continual tuning of vision foundation models to learn new classes with only a few samples per class. Unlike traditional FSCIL, FSCIT does not require a large in-distribution base session for initial fully supervised training prior to the incremental few-shot sessions. We conduct extensive evaluations across 16 diverse datasets, demonstrating the effectiveness of CoACT in both FSCIL and FSCIT setups. CoACT outperforms existing methods by up to 5.02% in FSCIL and up to 12.51% in FSCIT for individual datasets, with an average improvement of 2.47%. Furthermore, CoACT exhibits reduced forgetting and enhanced robustness in low-shot experiments. Detailed ablation and sensitivity studies highlight the contribution of each component of CoACT. We make our code publicly available at https://github.com/ShuvenduRoy/CoACT-FSCIL.
Paper Structure (17 sections, 3 equations, 5 figures, 8 tables)

This paper contains 17 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Performance comparison on FSCIT with a foundation model.
  • Figure 2: Performance comparison to existing methods on tradition FSCIL.
  • Figure 3: Illustration of CoACT. (Left) Training on the first incremental session with asynchronous contrastive tuning and controlled fine-tuning. The student encoder contains learnable LoRA modules, while the teacher is identical to the foundation model but updated as the EMA of the student. Controlled fine-tuning enables the tuning of a subset of the foundation model with reduced LR after certain epochs. (Right) Consistency-guided incremental tuning enforces consistency between the learnable student and the frozen encoder from the first session, providing additional regularization that prevents overfitting and forgetting.
  • Figure 4: (left) Forgetting of learned classes in FSCIT. (right) Accuracy breakdown into the first, remaining and all sessions in FSCIT setup.
  • Figure 5: Performance for different shots in FSCIT.