Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Yu-Chu Yu; Chi-Pin Huang; Jr-Jen Chen; Kai-Po Chang; Yung-Hsuan Lai; Fu-En Yang; Yu-Chiang Frank Wang

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang

TL;DR

This work tackles catastrophic forgetting and zero-shot degradation when fine-tuning large vision-language models across sequential tasks. It introduces Selective Dual-Teacher Knowledge Transfer, which employs both the most recently fine-tuned model $g_{k-1}$ and the original pre-trained model $g_0$ as dual teachers, selecting the appropriate teacher for each reference image via a dual-teacher discrepancy and a sigmoid-based selection score. The learned objective combines standard cross-entropy with a weighted dual KD term, enabling continual learning while preserving zero-shot capabilities without accessing previous task data. Empirical results on eight fine-grained datasets and MTIL/MCIL benchmarks show substantial improvements over state-of-the-art continual learning methods, with reduced forgetting and robust open-vocabulary transfer, albeit with limitations tied to the reference data distribution.

Abstract

Large-scale vision-language models (VLMs) have shown a strong zero-shot generalization capability on unseen-domain data. However, adapting pre-trained VLMs to a sequence of downstream tasks often leads to the forgetting of previously learned knowledge and a reduction in zero-shot classification performance. To tackle this problem, we propose a unique Selective Dual-Teacher Knowledge Transfer framework that leverages the most recent fine-tuned and the original pre-trained VLMs as dual teachers to preserve the previously learned knowledge and zero-shot capabilities, respectively. With only access to an unlabeled reference dataset, our proposed framework performs a selective knowledge distillation mechanism by measuring the feature discrepancy from the dual-teacher VLMs. Consequently, our selective dual-teacher knowledge distillation mitigates catastrophic forgetting of previously learned knowledge while preserving the zero-shot capabilities of pre-trained VLMs. Extensive experiments on benchmark datasets demonstrate that our framework is favorable against state-of-the-art continual learning approaches for preventing catastrophic forgetting and zero-shot degradation. Project page: https://chuyu.org/research/snd

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

TL;DR

and the original pre-trained model

as dual teachers, selecting the appropriate teacher for each reference image via a dual-teacher discrepancy and a sigmoid-based selection score. The learned objective combines standard cross-entropy with a weighted dual KD term, enabling continual learning while preserving zero-shot capabilities without accessing previous task data. Empirical results on eight fine-grained datasets and MTIL/MCIL benchmarks show substantial improvements over state-of-the-art continual learning methods, with reduced forgetting and robust open-vocabulary transfer, albeit with limitations tied to the reference data distribution.

Abstract

Paper Structure (43 sections, 8 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 43 sections, 8 equations, 9 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Rehearsal-Based Continual Learning.
Data-Free Continual Learning.
Continual Learning on Vision-Language Models.
Method
Problem Formulation
Selective Dual-Teacher Knowledge Transfer on VLMs
Dual-Teacher Discrepancy for Teacher Selection.
Selective Knowledge Distillation from Dual-Teachers.
Training and Inference
Training Phase.
Inference Phase.
Experiment
Implementation Detail
...and 28 more sections

Figures (9)

Figure 1: Compared with standard fine-tuning models, our Selective Dual-Teacher Knowledge Transfer advances continual learning to mitigate catastrophic forgetting on previously fine-tuned tasks, while preserving the model's zero-shot capability.
Figure 2: (a) The overall architecture of our proposed Selective Dual-Teacher Knowledge Transfer framework. (b) Selective knowledge transfer from $g_{k-1}$ due to larger discrepancy $d$ between dual teachers $g_0$ and $g_{k-1}$, alleviating catastrophic forgetting on Task $k-1$. (c) Selective knowledge transfer from $g_{0}$ due to smaller discrepancy $d$ between dual teachers $g_0$ and $g_{k-1}$, preserving the zero-shot capability of $g_0$.
Figure 3: Illustration of training and evaluation schemes for continual learning. From top to bottom rows, the pre-trained model $g_0$ is incrementally finetuned on different tasks (in green). For the incrementally learned model $g$ in each row, data of unseen tasks are shown in red, while that of previously fine-tuned ones are in blue.
Figure 4: Assessment of catastrophic forgetting with Aircraft (left) and Pets (right) as the first task in the continual learning sequence (i.e., the horizontal axis). It can be seen that our method is able to maintain their accuracies at the end of learning sequence.
Figure 5: Assessment of zero-shot degradation with UCF101 (left) and Food (right) as the last task in the continual learning sequence (i.e., the horizontal axis). It can be seen that our method shows satisfactory accuracies before finetuning on the last task.
...and 4 more figures

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

TL;DR

Abstract

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)