Table of Contents
Fetching ...

Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP

Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis

TL;DR

The paper addresses how task difficulty and embedding misalignment influence fine-tuning of CLIP under domain generalization and catastrophic forgetting. It introduces silhouette score $ss$ as a superior proxy for task difficulty compared to average cosine similarity $ACS$ and systematically compares five PEFT methods (Full, CLIP-Adapter, LoRA, BitFit, A-CLIP) across DG and CF evaluation setups. Key findings show $ss$ effectively predicts difficulty and informs performance trends; LoRA minimizes CF but may underperform on DG, BitFit suffers CF despite strong ID results, CLIP-Adapter and full fine-tuning improve DG at CF cost, and A-CLIP offers the best balance between DG gains and CF containment. The work provides practical guidance for selecting PEFT strategies in multimodal fine-tuning to balance generalization with forgetting, with implications for deploying CLIP-like models in varied domains.

Abstract

Foundations models are presented as generalists that often perform well over a myriad of tasks. Fine-tuning these models, even on limited data, provides an additional boost in task-specific performance but often at the cost of their wider generalization, an effect termed catastrophic forgetting. In this paper, we analyze the relation between task difficulty in the CLIP model and the performance of several simple parameter-efficient fine-tuning methods through the lens of domain generalization and catastrophic forgetting. We provide evidence that the silhouette score of the zero-shot image and text embeddings is a better measure of task difficulty than the average cosine similarity of correct image/label embeddings, and discuss observable relationships between task difficulty, fine-tuning method, domain generalization, and catastrophic forgetting. Additionally, the averaged results across tasks and performance measures demonstrate that a simplified method that trains only a subset of attention weights, which we call A-CLIP, yields a balance between domain generalization and catastrophic forgetting.

Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP

TL;DR

The paper addresses how task difficulty and embedding misalignment influence fine-tuning of CLIP under domain generalization and catastrophic forgetting. It introduces silhouette score as a superior proxy for task difficulty compared to average cosine similarity and systematically compares five PEFT methods (Full, CLIP-Adapter, LoRA, BitFit, A-CLIP) across DG and CF evaluation setups. Key findings show effectively predicts difficulty and informs performance trends; LoRA minimizes CF but may underperform on DG, BitFit suffers CF despite strong ID results, CLIP-Adapter and full fine-tuning improve DG at CF cost, and A-CLIP offers the best balance between DG gains and CF containment. The work provides practical guidance for selecting PEFT strategies in multimodal fine-tuning to balance generalization with forgetting, with implications for deploying CLIP-like models in varied domains.

Abstract

Foundations models are presented as generalists that often perform well over a myriad of tasks. Fine-tuning these models, even on limited data, provides an additional boost in task-specific performance but often at the cost of their wider generalization, an effect termed catastrophic forgetting. In this paper, we analyze the relation between task difficulty in the CLIP model and the performance of several simple parameter-efficient fine-tuning methods through the lens of domain generalization and catastrophic forgetting. We provide evidence that the silhouette score of the zero-shot image and text embeddings is a better measure of task difficulty than the average cosine similarity of correct image/label embeddings, and discuss observable relationships between task difficulty, fine-tuning method, domain generalization, and catastrophic forgetting. Additionally, the averaged results across tasks and performance measures demonstrate that a simplified method that trains only a subset of attention weights, which we call A-CLIP, yields a balance between domain generalization and catastrophic forgetting.
Paper Structure (11 sections, 3 equations, 5 figures, 4 tables)

This paper contains 11 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: In-domain and cross-dataset accuracy by model and training data.
  • Figure 2: Silhouette score and average cosine similarity of zero-shot image and text embeddings.
  • Figure 3: The top left figure compares alignment measure to accuracy and contains all training methods and test data. The bottom plots shows the Pearson correlation with 95% confidence interval, with the left using the log of the measures given that their relationship to accuracy appears exponential. The top right figure contains only ID test data and excludes results from 16-shot ImageNet and LoRA.
  • Figure 4: Average difference of layer subset magnitude with 95% confidence intervals.
  • Figure 5: Distributions of the difference in measure from the ZS to FT model for in-domain, DG, and CF. Here, 'ss' is the silhouette score of the ZS text and image embeddings minus the silhouette score of the FT text and image embeddings. A positive value thus means the clusters of image and text embeddings moved closer together. Similarly, 'cosine' is the cosine similarity score of the FT text and image embeddings minus the cosine similarity of the ZS text and image embeddings. Again, a positive value means the image and appropriate label embeddings moved closer to each other.