Table of Contents
Fetching ...

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

Jiachen Liang, Ruibing Hou, Minyang Hu, Hong Chang, Shiguang Shan, Xilin Chen

TL;DR

This work proposes a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC), which estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition and performs on par with the state-of-the-arts that need additional annotations or optimization.

Abstract

Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP's visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method on multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization. Our code is available at https://github.com/GIT-LJc/UMFC.

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

TL;DR

This work proposes a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC), which estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition and performs on par with the state-of-the-arts that need additional annotations or optimization.

Abstract

Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability of vision-language models. Under this unsupervised multi-domain setting, we have identified inherent model bias within CLIP, notably in its visual and text encoders. Specifically, we observe that CLIP's visual encoder tends to prioritize encoding domain over discriminative category information, meanwhile its text encoder exhibits a preference for domain-relevant classes. To mitigate this model bias, we propose a training-free and label-free feature calibration method, Unsupervised Multi-domain Feature Calibration (UMFC). UMFC estimates image-level biases from domain-specific features and text-level biases from the direction of domain transition. These biases are subsequently subtracted from original image and text features separately, to render them domain-invariant. We evaluate our method on multiple settings including transductive learning and test-time adaptation. Extensive experiments show that our method outperforms CLIP and performs on par with the state-of-the-arts that need additional annotations or optimization. Our code is available at https://github.com/GIT-LJc/UMFC.

Paper Structure

This paper contains 24 sections, 7 equations, 4 figures, 13 tables, 2 algorithms.

Figures (4)

  • Figure 1: On DomainNet dataset, we visualize (a) The accuracy of CLIP on the six domains. (b) The image features extracted by CLIP's image encoder across different domains. The visualization show that CLIP exhibits inherent model bias. (c) The number of predictions for different classes on quickdraw and painting domains.
  • Figure 2: On DomainNet dataset, we visualize (a) The image features extracted by UMFC image encoder across different domains. (b) The classification probabilities of CLIP's text features on different domains.
  • Figure 3: Visualization of Image Features based on OpenCLIP series.
  • Figure 4: The domain transition direction between texts is similar to that between images.