DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models
Eman Ali, Sathira Silva, Muhammad Haris Khan
TL;DR
The paper tackles unsupervised domain adaptation for CLIP-style vision-language models in label-sparse target domains, where cross-modal misalignment degrades pseudo-label quality. It introduces DPA, which uses dual prototypes—image prototypes $\mathbf{P}$ and textual prototypes $\mathbf{Z}$—and a convex fusion of their predictions to generate pseudo-labels, aided by a memory-bank to update $\mathbf{P}$, pseudo-label ranking, and a training objective combining self-training ($\mathcal{L}_{st}$), regularization ($\mathcal{L}_{reg}$), and prototype alignment ($\mathcal{L}_{align}$); an additional InfoNCE-based term aligns $\mathbf{P}$ and $\mathbf{Z}$, with textual prototypes initialized from multiple prompts so that $\mathbf{Z}_j = \frac{1}{k} \sum_{i=1}^k z_{ji}$. During inference, only the textual prototypes are used. Empirically, DPA achieves consistent improvements over zero-shot CLIP and strong unsupervised baselines across 13 datasets, demonstrating effective pseudo-labeling and robust cross-modal alignment without any labeled target data. The approach is scalable thanks to non-parametric image prototypes and targeted fine-tuning of limited components, offering practical impact for domain transfer in diverse vision-language tasks.
Abstract
Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.
