Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Cristina Menghini; Andrew Delworth; Stephen H. Bach

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Cristina Menghini, Andrew Delworth, Stephen H. Bach

TL;DR

It is found that unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, and unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy.

Abstract

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a "second generation" of pseudolabeling approaches that do not require task-specific training on labeled data. By using zero-shot pseudolabels as a source of supervision, we observe that learning paradigms such as semi-supervised, transductive zero-shot, and unsupervised learning can all be seen as optimizing the same loss function. This unified view enables the development of versatile training strategies that are applicable across learning paradigms. We investigate them on image classification tasks where CLIP exhibits limitations, by varying prompt modalities, e.g., textual or visual prompts, and learning paradigms. We find that (1) unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5 points in semi-supervised learning, by 28.4 points in transductive zero-shot learning, and by 15.2 points in unsupervised learning, and (2) unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy. The code to reproduce the experiments is at https://github.com/BatsResearch/menghini-neurips23-code.

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

TL;DR

Abstract

Paper Structure (60 sections, 4 equations, 9 figures, 10 tables)

This paper contains 60 sections, 4 equations, 9 figures, 10 tables.

Introduction
Background and related work
Vision-language models
Prompt tuning
Learning from pseudolabels
Design space
Pseudolabeling scheme
Unified objective function
Prompt modalities
Learning paradigms
Semi-supervised learning
Transductive zero-shot learning
Unsupervised learning
Supervised learning
Training strategies
...and 45 more sections

Figures (9)

Figure 1: Our design space to explore the effect of leveraging pseudolabels in a unified way across prompt modalities, learning paradigms, and training strategies. The green (dashed) path has already been explored Huang2022UnsupervisedPL, while the red (solid) lines are the unexplored combinations for prompt tuning.
Figure 2: Balance of seen and unseen accuracies vs. model's overall accuracy. Points close to 0 indicate a good balance. Negatives represent better accuracy for the seen classes.
Figure 3: Evolution of pseudolabels accuracy during training. The rows refer to SSL, UL, and TRZSL, in order. IFPL refers to the top x-axis, while CLIP and GRIP to the bottom.
Figure 4: Improvements of FPL and GRIP on CLIP's per-class accuracies (RESICS45). The x-axis is the ranked class index, while y-axis is the accuracy.
Figure 5: For each dataset we show the distribution of CLIP's predictions over classes on the test set. The blue dots represent the true class distribution.
...and 4 more figures

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

TL;DR

Abstract

Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)