Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

Jiahan Zhang; Qi Wei; Feng Liu; Lei Feng

Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

Jiahan Zhang, Qi Wei, Feng Liu, Lei Feng

TL;DR

The paper tackles the challenge of adapting vision‑language models with abundant unlabeled data when zero‑shot performance is insufficient for reliable hard pseudolabeling. It proposes Candidate Pseudolabel Learning (CPL), a framework that builds candidate pseudolabel sets from a confidence matrix via intra‑ and inter‑instance strategies and then learns with partial‑label losses in an iterative loop. The method reframes downstream learning as partial‑label learning, enabling the use of standard losses and promoting balanced, high‑quality label coverage. Across nine datasets and three unlabeled‑data paradigms, CPL consistently outperforms hard pseudolabel baselines and shows robustness to varying zero‑shot abilities, suggesting strong practical benefits for data‑efficient fine‑tuning of vision‑language models.

Abstract

Fine-tuning vision-language models (VLMs) with abundant unlabeled data recently has attracted increasing attention. Existing methods that resort to the pseudolabeling strategy would suffer from heavily incorrect hard pseudolabels when VLMs exhibit low zero-shot performance in downstream tasks. To alleviate this issue, we propose a Candidate Pseudolabel Learning method, termed CPL, to fine-tune VLMs with suitable candidate pseudolabels of unlabeled data in downstream tasks. The core of our method lies in the generation strategy of candidate pseudolabels, which progressively generates refined candidate pseudolabels by both intra- and inter-instance label selection, based on a confidence score matrix for all unlabeled data. This strategy can result in better performance in true label inclusion and class-balanced instance selection. In this way, we can directly apply existing loss functions to learn with generated candidate psueudolabels. Extensive experiments on nine benchmark datasets with three learning paradigms demonstrate the effectiveness of our method. Our code can be found at https://github.com/vanillaer/CPL-ICML2024.

Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 11 figures, 10 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 11 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Vison-Language Models
Prompt Tuning
Learning from Unlabeled Data
Methodology
Scheme for Generating Candidate Pseudolabels
Learning with Candidate Pseudolabels
Experiments
Experimental Setting
Comparison with Previous Methods
More Analyses
Ablation Studies
Limitations
Conclusion
...and 7 more sections

Figures (11)

Figure 1: (a) Confusion matrix between true labels and hard pseudolabels of dataset EuroSAT, where incorrect and imbalanced pseudolabels are always generated. (b) An example illustration of a set of candidate pseudolabels, which consists of classes with the top-2 highest confidence scores.
Figure 2: Our candidate pseudolabel learning (CPL) significantly surpasses hard pseudolabel learning menghini2023enhancing on the RESISC45 dataset in terms of label estimation accuracy (label estimation accuracy is defined as the rate at which the true label is included in the pseudolabels), leading to improved performance on test accuracy.
Figure 3: Illustration of the training target generation process in our CPL method. At the beginning of each training iteration, we first construct a confidence score matrix composed of confidence score vector $\boldsymbol{p}$ for each unlabeled instance. Then, candidate pseudolabels, derived from both intra- and inter-level selection, are extracted to formulate the training target $\boldsymbol{s}$ for the subsequent model training.
Figure 4: Visualization of the average set size of candidate pseudolabels among all unlabeled data on six datasets under the UL setting of textual prompt tuning.
Figure 5: Visualization of the performance improvement of CLIP with CPL (each class only has two labeled data) and fully supervised few-shot learning when textual prompt tuning is applied. The $x$-axis in blue represents the number of labeled instances, while the $x$-axis in red represents the proportion of the unlabeled dataset. Both lines originate from the zero-shot performance of CLIP.
...and 6 more figures

Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

TL;DR

Abstract

Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

Authors

TL;DR

Abstract

Table of Contents

Figures (11)