Table of Contents
Fetching ...

Enhanced Continual Learning of Vision-Language Models with Model Fusion

Haoyuan Gao, Zicong Zhang, Yuqi Wei, Linglan Zhao, Guilin Li, Yexin Li, Linghe Kong, Weiran Huang

TL;DR

This work targets catastrophic forgetting in Vision-Language Models during sequential task fine-tuning. It introduces Continual Decoupling-Unifying (ConDU), which uses model fusion to maintain a unified base model $\theta^0$ plus accumulated deltas $\boldsymbol{δ}^{1:t}$, plus task triggers and prototype sets for efficient reconstruction of task-specific models. At training time, ConDU decouples and unifies delta models, while at inference time it reconstructs multiple task-specific models and aggregates predictions via a semantic prototype mechanism, enabling robust zero-shot inference. Across MTIL benchmarks, ConDU delivers up to ~2% higher average performance on seen tasks and superior zero-shot capabilities without reference datasets or extensive hyperparameter tuning.

Abstract

Vision-Language Models (VLMs) represent a breakthrough in artificial intelligence by integrating visual and textual modalities to achieve impressive zero-shot capabilities. However, VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks. Existing continual learning methods for VLMs often rely heavily on additional reference datasets, compromise zero-shot performance, or are limited to parameter-efficient fine-tuning scenarios. In this paper, we propose Continual Decoupling-Unifying (ConDU), a novel approach, by introducing model fusion into continual learning for VLMs. ConDU maintains a unified model along with task triggers and prototype sets, employing an iterative process of decoupling task-specific models for previous tasks and unifying them with the model for the newly learned task. Additionally, we introduce an inference strategy for zero-shot scenarios by aggregating predictions from multiple decoupled task-specific models. Extensive experiments across various settings show that ConDU achieves up to a 2\% improvement in average performance across all seen tasks compared to state-of-the-art baselines, while also enhancing zero-shot capabilities relative to the original VLM.

Enhanced Continual Learning of Vision-Language Models with Model Fusion

TL;DR

This work targets catastrophic forgetting in Vision-Language Models during sequential task fine-tuning. It introduces Continual Decoupling-Unifying (ConDU), which uses model fusion to maintain a unified base model plus accumulated deltas , plus task triggers and prototype sets for efficient reconstruction of task-specific models. At training time, ConDU decouples and unifies delta models, while at inference time it reconstructs multiple task-specific models and aggregates predictions via a semantic prototype mechanism, enabling robust zero-shot inference. Across MTIL benchmarks, ConDU delivers up to ~2% higher average performance on seen tasks and superior zero-shot capabilities without reference datasets or extensive hyperparameter tuning.

Abstract

Vision-Language Models (VLMs) represent a breakthrough in artificial intelligence by integrating visual and textual modalities to achieve impressive zero-shot capabilities. However, VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks. Existing continual learning methods for VLMs often rely heavily on additional reference datasets, compromise zero-shot performance, or are limited to parameter-efficient fine-tuning scenarios. In this paper, we propose Continual Decoupling-Unifying (ConDU), a novel approach, by introducing model fusion into continual learning for VLMs. ConDU maintains a unified model along with task triggers and prototype sets, employing an iterative process of decoupling task-specific models for previous tasks and unifying them with the model for the newly learned task. Additionally, we introduce an inference strategy for zero-shot scenarios by aggregating predictions from multiple decoupled task-specific models. Extensive experiments across various settings show that ConDU achieves up to a 2\% improvement in average performance across all seen tasks compared to state-of-the-art baselines, while also enhancing zero-shot capabilities relative to the original VLM.

Paper Structure

This paper contains 36 sections, 6 theorems, 5 equations, 4 figures, 7 tables.

Key Result

Lemma F.1

For a task $\delta^i$ in a continual learning session, the parameter at any position is guaranteed to preserve its sign. Specifically, if a position in $\delta^i$ (denoted as $a^i(1)$) is positive (or negative) after an iteration, then $\forall j, a^i(j) \geq 0$ (or $\leq 0$). Moreover, if $a^i_k =

Figures (4)

  • Figure 1: Overall framework of the proposed method. The colored points in the process denote modules of our method, including Tuning Individually, Decoupling Unified Model (see Figure \ref{['unifying_decoupling']}a), Unifying Models (see Figure \ref{['unifying_decoupling']}b), Aggregating Prediction (see Figure \ref{['voting_mechanism']}a), and Computing Prototypes (see Figure \ref{['voting_mechanism']}b).
  • Figure 2: The process of Unifying Models (a) and Decoupling Unified Model (b) is transformed to unifying delta models (a) and decoupling unified delta model (b), respectively. A delta model represents the parameter offsets of a task-specific model relative to the pre-trained VLM. a) When unifying delta models, the unified model is obtained by an election process. Each task's task trigger is calculated according to the difference between the delta model and the unified delta model. b) When decoupling the unified delta model, we use the task trigger $i$ on the unified delta model to reconstruct the delta model $i$.
  • Figure 3: a) The process of Aggregating Prediction: we calculate the cosine similarity between the image embedding of the test sample and prototypes of each category in the feature space of pre-trained VLM, then choose the maximum similarity in each task as the weight of the corresponding task-specific model. b) The process of Computing Prototypes: The prototype of each category is the mean of the image feature vectors plus the text feature vector for that category, all extracted by the original pre-trained VLM.
  • Figure 4: $t$-SNE Visualization of Feature Space.

Theorems & Definitions (11)

  • Lemma F.1: Sign Preservation of $\delta^i$
  • proof
  • Lemma F.2: Preservation of $L_1$ Norm
  • proof
  • Theorem F.3: Convergence of Iteration
  • proof
  • Corollary F.4
  • Theorem F.5
  • proof
  • Corollary F.6
  • ...and 1 more