Table of Contents
Fetching ...

DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

Eman Ali, Sathira Silva, Muhammad Haris Khan

TL;DR

The paper tackles unsupervised domain adaptation for CLIP-style vision-language models in label-sparse target domains, where cross-modal misalignment degrades pseudo-label quality. It introduces DPA, which uses dual prototypes—image prototypes $\mathbf{P}$ and textual prototypes $\mathbf{Z}$—and a convex fusion of their predictions to generate pseudo-labels, aided by a memory-bank to update $\mathbf{P}$, pseudo-label ranking, and a training objective combining self-training ($\mathcal{L}_{st}$), regularization ($\mathcal{L}_{reg}$), and prototype alignment ($\mathcal{L}_{align}$); an additional InfoNCE-based term aligns $\mathbf{P}$ and $\mathbf{Z}$, with textual prototypes initialized from multiple prompts so that $\mathbf{Z}_j = \frac{1}{k} \sum_{i=1}^k z_{ji}$. During inference, only the textual prototypes are used. Empirically, DPA achieves consistent improvements over zero-shot CLIP and strong unsupervised baselines across 13 datasets, demonstrating effective pseudo-labeling and robust cross-modal alignment without any labeled target data. The approach is scalable thanks to non-parametric image prototypes and targeted fine-tuning of limited components, offering practical impact for domain transfer in diverse vision-language tasks.

Abstract

Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.

DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

TL;DR

The paper tackles unsupervised domain adaptation for CLIP-style vision-language models in label-sparse target domains, where cross-modal misalignment degrades pseudo-label quality. It introduces DPA, which uses dual prototypes—image prototypes and textual prototypes —and a convex fusion of their predictions to generate pseudo-labels, aided by a memory-bank to update , pseudo-label ranking, and a training objective combining self-training (), regularization (), and prototype alignment (); an additional InfoNCE-based term aligns and , with textual prototypes initialized from multiple prompts so that . During inference, only the textual prototypes are used. Empirically, DPA achieves consistent improvements over zero-shot CLIP and strong unsupervised baselines across 13 datasets, demonstrating effective pseudo-labeling and robust cross-modal alignment without any labeled target data. The approach is scalable thanks to non-parametric image prototypes and targeted fine-tuning of limited components, offering practical impact for domain transfer in diverse vision-language tasks.

Abstract

Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.
Paper Structure (7 sections, 11 equations, 6 figures, 9 tables)

This paper contains 7 sections, 11 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison of t-SNE projections for zero-shot CLIP clip, ReCLIP reclip, and DPA visual embeddings, along with their corresponding visual (circle $\newmoon$) and textual (star $\bigstar$) prototypes on the EuroSAT dataset. The visual prototypes are computed as the mean of each cluster. For clarity, class-agnostic and redundant features are removed from all embeddings using a fixed projection prior to applying the t-SNE projection, following reclip. The cosine similarities between visual and textual prototypes and the well-separated visual clusters, as illustrated in (c), demonstrate the superior performance of our method in capturing both inter-modal and intra-modal alignment compared to the existing approaches depicted in (a) and (b).
  • Figure 2: The overall framework of DPA. (a) Given a target dataset, DPA utilizes a set of carefully designed prompts to initialize the textual prototypes using the CLIP's zero-shot textual encoder $E_t$. (b) To achieve effective self-training, DPA introduces dual prototypes, namely image and textual prototypes, that behave like two distinct classifiers, and it fuses their outputs via convex combination to form accurate pseudo-labels (PLs). Moreover, it ranks PLs for classification loss to alleviate noisy PLs impact during early self-training. Finally, DPA aligns textual and visual prototypes to adeptly adjust to the target feature semantic relations learned by the visual encoder. (c) During inference, DPA discards the visual prototypes and relies solely on the textual prototypes for prediction.
  • Figure 3: Sensitivity analysis of the hyperparameters $\lambda_2$, $\lambda_3$, and $\beta$ on the accuracy ($\%$) of DPA. We set $\lambda_1=1$ across all datasets (here ESAT is used) since $\mathcal{L}_{st}$ is our primary loss.
  • Figure 4: Base vs. DPA in (a) PLs, (b) training, and (c) testing accuracy on EuroSAT helber2019eurosat dataset.
  • Figure 5: Top-1 accuracy ($\%$) for the ViT-B/32 backbone on the CIFAR-10 dataset using 20$\%$, 40$\%$, 60$\%$, 80$\%$, and 100$\%$ of the training data for DPA.
  • ...and 1 more figures