Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP
Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis
TL;DR
The paper addresses how task difficulty and embedding misalignment influence fine-tuning of CLIP under domain generalization and catastrophic forgetting. It introduces silhouette score $ss$ as a superior proxy for task difficulty compared to average cosine similarity $ACS$ and systematically compares five PEFT methods (Full, CLIP-Adapter, LoRA, BitFit, A-CLIP) across DG and CF evaluation setups. Key findings show $ss$ effectively predicts difficulty and informs performance trends; LoRA minimizes CF but may underperform on DG, BitFit suffers CF despite strong ID results, CLIP-Adapter and full fine-tuning improve DG at CF cost, and A-CLIP offers the best balance between DG gains and CF containment. The work provides practical guidance for selecting PEFT strategies in multimodal fine-tuning to balance generalization with forgetting, with implications for deploying CLIP-like models in varied domains.
Abstract
Foundations models are presented as generalists that often perform well over a myriad of tasks. Fine-tuning these models, even on limited data, provides an additional boost in task-specific performance but often at the cost of their wider generalization, an effect termed catastrophic forgetting. In this paper, we analyze the relation between task difficulty in the CLIP model and the performance of several simple parameter-efficient fine-tuning methods through the lens of domain generalization and catastrophic forgetting. We provide evidence that the silhouette score of the zero-shot image and text embeddings is a better measure of task difficulty than the average cosine similarity of correct image/label embeddings, and discuss observable relationships between task difficulty, fine-tuning method, domain generalization, and catastrophic forgetting. Additionally, the averaged results across tasks and performance measures demonstrate that a simplified method that trains only a subset of attention weights, which we call A-CLIP, yields a balance between domain generalization and catastrophic forgetting.
