Table of Contents
Fetching ...

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

Bin Wu, Wuxuan Shi, Jinqiao Wang, Mang Ye

TL;DR

This work addresses forgetting of pre-training knowledge during continual fine-tuning of Vision-Language Models by introducing GIFT, a framework that uses diffusion-generated synthetic image-text data to replay prior knowledge via contrastive distillation and image-text alignment. It combines this distillation with adaptive weight consolidation based on online Fisher information to maintain stability while learning new tasks. Empirical results on MTIL and CIL benchmarks show that GIFT consistently surpasses state-of-the-art methods using only a small synthetic data budget (e.g., 1K samples per task), highlighting the practicality of diffusion-based replay for VLM continual learning. Overall, GIFT demonstrates that synthetic data can effectively approximate pre-training distributions and support robust, data-efficient continual adaptation of VLMs.

Abstract

Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge previously learned from downstream tasks, pre-training knowledge is also corrupted during continual fine-tuning. This issue is exacerbated by the unavailability of original pre-training data, leaving VLM's generalization ability degrading. In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. Taking advantage of recent advances in text-to-image synthesis, we employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. In this way, the VLM can revisit previous knowledge through distillation on matching diffusion-generated images and corresponding text prompts. Leveraging the broad distribution and high alignment between synthetic image-text pairs in VLM's feature space, we propose a contrastive distillation loss along with an image-text alignment constraint. To further combat in-distribution overfitting and enhance distillation performance with limited amount of generated data, we incorporate adaptive weight consolidation, utilizing Fisher information from these synthetic image-text pairs and achieving a better stability-plasticity balance. Extensive experiments demonstrate that our method consistently outperforms previous state-of-the-art approaches across various settings.

Synthetic Data is an Elegant GIFT for Continual Vision-Language Models

TL;DR

This work addresses forgetting of pre-training knowledge during continual fine-tuning of Vision-Language Models by introducing GIFT, a framework that uses diffusion-generated synthetic image-text data to replay prior knowledge via contrastive distillation and image-text alignment. It combines this distillation with adaptive weight consolidation based on online Fisher information to maintain stability while learning new tasks. Empirical results on MTIL and CIL benchmarks show that GIFT consistently surpasses state-of-the-art methods using only a small synthetic data budget (e.g., 1K samples per task), highlighting the practicality of diffusion-based replay for VLM continual learning. Overall, GIFT demonstrates that synthetic data can effectively approximate pre-training distributions and support robust, data-efficient continual adaptation of VLMs.

Abstract

Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge previously learned from downstream tasks, pre-training knowledge is also corrupted during continual fine-tuning. This issue is exacerbated by the unavailability of original pre-training data, leaving VLM's generalization ability degrading. In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. Taking advantage of recent advances in text-to-image synthesis, we employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. In this way, the VLM can revisit previous knowledge through distillation on matching diffusion-generated images and corresponding text prompts. Leveraging the broad distribution and high alignment between synthetic image-text pairs in VLM's feature space, we propose a contrastive distillation loss along with an image-text alignment constraint. To further combat in-distribution overfitting and enhance distillation performance with limited amount of generated data, we incorporate adaptive weight consolidation, utilizing Fisher information from these synthetic image-text pairs and achieving a better stability-plasticity balance. Extensive experiments demonstrate that our method consistently outperforms previous state-of-the-art approaches across various settings.

Paper Structure

This paper contains 13 sections, 10 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: We use synthetic data generated by Stable Diffusion rombach2022high to support continual fine-tuning of VLMs. By creating prompts from learned downstream class names and diverse visual concepts (i.e., additional class names), the generated data effectively approximates both downstream and VLM’s pre-training data. During training, knowledge distillation hinton2015distilling enables the VLM to re-experience previous knowledge via its past responses on matching synthetic images and corresponding text prompts.
  • Figure 2: Framework overview of GIFT. (a) Synthetic Data-based Distillation aligns the output of the current CLIP model $\theta^t$ with the previous model $\theta^{t-1}$ on matching synthetic image-text pairs when learning a new task. Image-text alignment loss is applied to correct errors in the teacher model through hard target, i.e., the alignment matrix. (b) Adaptive Weight Consolidation employs a parameter importance weighted $l_2$ penalty to limit parameter changes causing forgetting and overfitting. By leveraging the Fisher information $\mathcal{F}_{\theta^t}$ from synthetic image-text pairs during training, parameter importance is adjusted in real-time to achieve a better stability-plasticity balance.
  • Figure 3: Generating different numbers of synthetic images as distillation data sources for each task produces different results.
  • Figure 4: CD loss for the first 5 tasks in MTIL order I. In our implementation, cross-entropy is used as equivalent instead of KL divergence to compute $\mathcal{L}_{CD}$, and the results are presented accordingly. Lower loss means better mitigation of forgetting.
  • Figure 5: Loss values on a two dimensional slice of the loss landscapes. We use $W_0$, $W_1$ and $W_2$ to represent the initial CLIP and CLIP models finetuned on the Aircraft maji2013fine dataset without and with AWC, respectively. As in garipov2018loss, we obtain an orthonormal basis $u_1$, $u_2$ for the plane spanned by these models, and the x and y-axis show movement in parameter space in these two directions.
  • ...and 6 more figures