Table of Contents
Fetching ...

Training-Free Unsupervised Prompt for Vision-Language Models

Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, Shengsheng Wang, Jingdong Wang

TL;DR

This work tackles the challenge of adapting vision-language models without labeled data by proposing Training-Free Unsupervised Prompt (TFUP), which preserves the pre-trained representations while leveraging similarity-based predictions through a Feature Cache Model (FCM) and a Multi-level Similarity Measure (MSM). TFUP constructs a cache of representative samples via confidence and prototype filtering and performs training-free inference by combining feature-level and semantic-level similarities to produce similarity-based probabilities. A training-based extension, TFUP-T, adds parameter-efficient adapters with residual connections and introduces both pseudo-label cross-entropy loss and a global marginal distribution entropy loss to further boost performance. Across domain adaptation benchmarks (Domain-Net, Office-Home, Office-31, VisDA-2017), TFUP achieves state-of-the-art results among unsupervised methods and surpasses several few-shot approaches, demonstrating the practical impact of training-free adaptation for VLMs.

Abstract

Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.

Training-Free Unsupervised Prompt for Vision-Language Models

TL;DR

This work tackles the challenge of adapting vision-language models without labeled data by proposing Training-Free Unsupervised Prompt (TFUP), which preserves the pre-trained representations while leveraging similarity-based predictions through a Feature Cache Model (FCM) and a Multi-level Similarity Measure (MSM). TFUP constructs a cache of representative samples via confidence and prototype filtering and performs training-free inference by combining feature-level and semantic-level similarities to produce similarity-based probabilities. A training-based extension, TFUP-T, adds parameter-efficient adapters with residual connections and introduces both pseudo-label cross-entropy loss and a global marginal distribution entropy loss to further boost performance. Across domain adaptation benchmarks (Domain-Net, Office-Home, Office-31, VisDA-2017), TFUP achieves state-of-the-art results among unsupervised methods and surpasses several few-shot approaches, demonstrating the practical impact of training-free adaptation for VLMs.

Abstract

Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
Paper Structure (29 sections, 11 equations, 5 figures, 7 tables)

This paper contains 29 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Zero-shot inference of the pre-trained CLIP. (b) Existing unsupervised prompt tuning methods such as UPL huang2022unsupervised and POUF tanwisuth2023pouf, which fine-tune models or prompts directly on unlabeled data. (c) Our training-free unsupervised prompt (TFUP) method generates similarity-base prediction probabilities by customizing the proposed Feature Cache Model (FCM) and Multi-level Similarity Measure (MSM).
  • Figure 2: Performance comparisons of CLIP radford2021learning, POUF tanwisuth2023pouf, TFUP, TFUP-T, and KgCoOp yao2023visual on Domain-Net and Office-Home datasets in terms of top-1 classification accuracy.
  • Figure 3: Overview of the TFUP framework. Our TFUP creates a Feature Cache Model (FCM) from the unsupervised training set by confidence and prototype filters. Based on the cache model, we propose a Multi-level Similarity Measure (MSM) consisting of Feature Similarity Measure (FSM) and Semantic Similarity Measure (SSM) to calculate the distance between each test image and the cached sample as the weights of corresponding cache label to generate similarity-base prediction probabilities.
  • Figure 4: Framework of our proposed unsupervised prompt tuning (TFUP-T). Our TFUP-T appends CLIP model with an adapter of two-layer Multi-layer Perceptron which is optimized by the cross-entropy loss and marginal distribution entropy loss.
  • Figure 5: Sensitivity analysis of $\alpha$ and ${\beta}$ on Office-Home.