Table of Contents
Fetching ...

Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

Longxiang Tang, Zhuotao Tian, Kai Li, Chunming He, Hantao Zhou, Hengshuang Zhao, Xiu Li, Jiaya Jia

TL;DR

This work tackles Domain-Class Incremental Learning for vision-language models by preventing forward forgetting of pre-trained zero-shot knowledge while enabling efficient continual adaptation. It introduces DIKI, which uses a zero-initialized residual attention path to inject task-specific information into a frozen VLM backbone and a distribution-aware calibration to modulate implantation for unseen distributions. DIKI achieves state-of-the-art results on the MTIL DCIL benchmark with only 0.86% trainable parameters and reduced training time, without relying on external data. The approach offers a practical, parameter-efficient solution for maintaining zero-shot generalization across diverse tasks and distributions.

Abstract

This study addresses the Domain-Class Incremental Learning problem, a realistic but challenging continual learning scenario where both the domain distribution and target classes vary across tasks. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. However, this incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability. Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy computation overhead. To address this problem efficiently, we propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of VLMs from a perspective of avoiding information interference. Specifically, we design a fully residual mechanism to infuse newly learned knowledge into a frozen backbone, while introducing minimal adverse impacts on pre-trained knowledge. Besides, this residual property enables our distribution-aware integration calibration scheme, explicitly controlling the information implantation process for test data from unseen distributions. Experiments demonstrate that our DIKI surpasses the current state-of-the-art approach using only 0.86% of the trained parameters and requiring substantially less training time. Code is available at: https://github.com/lloongx/DIKI .

Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

TL;DR

This work tackles Domain-Class Incremental Learning for vision-language models by preventing forward forgetting of pre-trained zero-shot knowledge while enabling efficient continual adaptation. It introduces DIKI, which uses a zero-initialized residual attention path to inject task-specific information into a frozen VLM backbone and a distribution-aware calibration to modulate implantation for unseen distributions. DIKI achieves state-of-the-art results on the MTIL DCIL benchmark with only 0.86% trainable parameters and reduced training time, without relying on external data. The approach offers a practical, parameter-efficient solution for maintaining zero-shot generalization across diverse tasks and distributions.

Abstract

This study addresses the Domain-Class Incremental Learning problem, a realistic but challenging continual learning scenario where both the domain distribution and target classes vary across tasks. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. However, this incurs a new problem: the knowledge encoded in the pre-trained VLMs may be disturbed when adapting to new tasks, compromising their inherent zero-shot ability. Existing methods tackle it by tuning VLMs with knowledge distillation on extra datasets, which demands heavy computation overhead. To address this problem efficiently, we propose the Distribution-aware Interference-free Knowledge Integration (DIKI) framework, retaining pre-trained knowledge of VLMs from a perspective of avoiding information interference. Specifically, we design a fully residual mechanism to infuse newly learned knowledge into a frozen backbone, while introducing minimal adverse impacts on pre-trained knowledge. Besides, this residual property enables our distribution-aware integration calibration scheme, explicitly controlling the information implantation process for test data from unseen distributions. Experiments demonstrate that our DIKI surpasses the current state-of-the-art approach using only 0.86% of the trained parameters and requiring substantially less training time. Code is available at: https://github.com/lloongx/DIKI .
Paper Structure (10 sections, 10 equations, 5 figures, 4 tables)

This paper contains 10 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a): The domain-class incremental learning setting, where the data distribution and the classes vary across all tasks. Two kinds of forgetting exist due to the integration of pre-trained CLIP. (b): The forward accuracy (i.e. zero-shot ability) and the number of trainable parameters for each method, with the size of the markers representing their computational complexity. (c): Existing methods either demand heavy computation or sacrifice pre-trained knowledge. Our approach effectively retain pre-trained knowledge within a parameter-efficient framework. More details are provided in \ref{['sec:IKI']}.
  • Figure 2: Illustration of the information interference issue in previous prompt tuning methods and our proposed DIKI. (a) The existing methods mix attention derived from the frozen backbone and prepended prompts, which can cause information loss and finally harm the zero-shot ability. (b) We design a zero-initialized residual attention mechanism, which injects new information with pre-trained knowledge untouched, to retain the vision-language models' zero-shot ability. Distribution-aware integration calibration is also introduced to further boost performance thanks to the residual property.
  • Figure 3: Transfer and Last scores (%) with different uniform initialization bounds for residual attention parameters on MTIL benchmark. A larger initialization value will not affect the final accuracy (Last score), but could have a severe adverse impact on the model's zero-shot ability, due to the random noise introduced into the pre-trained model.
  • Figure 4: Demonstration of the effect of our distribution-aware integration calibration. We evaluate the model, which is only trained on the first task of MTIL, on the trained task and unseen tasks, with manually assigned calibration weights. Fixed larger weights maintain high accuracy on trained task while lose zero-shot ability, and vice versa. Our DIKI tailors weight for different samples during inference time.
  • Figure 5: Heatmap visualization comparisons. We employ Grad-CAM selvaraju2017grad to evaluate the model, which only has been trained on Aircraft maji2013fine, across unseen datasets OxfordPet parkhi2012cats, Flowers nilsback2008automated and Food-101 bossard2014food. It demonstrates that the commonly used prompt-based methods introduce noise into the model, thus resulting in forward forgetting issue and model degradation. Our DIKI implants new knowledge in a fully residual manner, optimizing the retention of pre-trained knowledge.