Table of Contents
Fetching ...

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

Zhifang Zhang, Yuwei Niu, Xin Liu, Beibei Li

TL;DR

This paper addresses tuning vision-language models when only candidate labels are available, a setting motivated by privacy and labeling constraints. It first shows that vanilla prompt learning with partial-label supervision can learn from candidate labels but suffers from degraded performance as label ambiguity grows. The authors propose a prompt-alignment framework that dynamically mixes outputs from handcrafted and learnable prompts and enforces agreement with the model output, improving robustness across PLL objectives while keeping most parameters fixed. Extensive experiments across eight datasets and multiple PLL methods demonstrate substantial gains over vanilla prompt learning and competitive performance relative to fully supervised or zero-shot baselines, highlighting the practical impact for privacy-aware labeling scenarios.

Abstract

Vision-language models (VLMs) can learn high-quality representations from a large-scale training dataset of image-text pairs. Prompt learning is a popular approach to fine-tuning VLM to adapt them to downstream tasks. Despite the satisfying performance, a major limitation of prompt learning is the demand for labelled data. In real-world scenarios, we may only obtain candidate labels (where the true label is included) instead of the true labels due to data privacy or sensitivity issues. In this paper, we provide the first study on prompt learning with candidate labels for VLMs. We empirically demonstrate that prompt learning is more advantageous than other fine-tuning methods, for handling candidate labels. Nonetheless, its performance drops when the label ambiguity increases. In order to improve its robustness, we propose a simple yet effective framework that better leverages the prior knowledge of VLMs to guide the learning process with candidate labels. Specifically, our framework disambiguates candidate labels by aligning the model output with the mixed class posterior jointly predicted by both the learnable and the handcrafted prompt. Besides, our framework can be equipped with various off-the-shelf training objectives for learning with candidate labels to further improve their performance. Extensive experiments demonstrate the effectiveness of our proposed framework.

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

TL;DR

This paper addresses tuning vision-language models when only candidate labels are available, a setting motivated by privacy and labeling constraints. It first shows that vanilla prompt learning with partial-label supervision can learn from candidate labels but suffers from degraded performance as label ambiguity grows. The authors propose a prompt-alignment framework that dynamically mixes outputs from handcrafted and learnable prompts and enforces agreement with the model output, improving robustness across PLL objectives while keeping most parameters fixed. Extensive experiments across eight datasets and multiple PLL methods demonstrate substantial gains over vanilla prompt learning and competitive performance relative to fully supervised or zero-shot baselines, highlighting the practical impact for privacy-aware labeling scenarios.

Abstract

Vision-language models (VLMs) can learn high-quality representations from a large-scale training dataset of image-text pairs. Prompt learning is a popular approach to fine-tuning VLM to adapt them to downstream tasks. Despite the satisfying performance, a major limitation of prompt learning is the demand for labelled data. In real-world scenarios, we may only obtain candidate labels (where the true label is included) instead of the true labels due to data privacy or sensitivity issues. In this paper, we provide the first study on prompt learning with candidate labels for VLMs. We empirically demonstrate that prompt learning is more advantageous than other fine-tuning methods, for handling candidate labels. Nonetheless, its performance drops when the label ambiguity increases. In order to improve its robustness, we propose a simple yet effective framework that better leverages the prior knowledge of VLMs to guide the learning process with candidate labels. Specifically, our framework disambiguates candidate labels by aligning the model output with the mixed class posterior jointly predicted by both the learnable and the handcrafted prompt. Besides, our framework can be equipped with various off-the-shelf training objectives for learning with candidate labels to further improve their performance. Extensive experiments demonstrate the effectiveness of our proposed framework.
Paper Structure (13 sections, 9 equations, 3 figures, 8 tables)

This paper contains 13 sections, 9 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: There are possible candidate labels corresponding to the image of a falcon. In the right portion of the figure, candidate labels are colored in blue. In the candidate label set, the hidden true label is underlined to be distinct from the false-positive labels.
  • Figure 2: Illustration of our framework. Our framework includes a prompt alignment module and a PLL method module. Regarding the prompt alignment module, two different prompts are input to the text encoder, including the handcrafted prompt and the learnable prompt. Afterwards, we mix the recalculated class posteriors yielded by the handcrafted prompt and the learnable prompt. Then, the mixed class posterior is aligned with the output of the learnable prompt using the re-weighted cross-entropy loss. For the PLL method module, any prevailing PLL methods can be combined with the prompt alignment module. It is important to know that during fine-tuning, all the parameters of this framework are frozen except for the learnable prompt.
  • Figure 3: Performance comparison with multiple fine-tuning approaches combined with PiCO wang2021pico in a vanilla way on UCF101 soomro2012ucf101 and Caltech101 fei2004learning with candidate labels of the incremental label ambiguity. vPLL means a simple baseline that treats every candidate label as the ground-truth label and uses cross-entropy loss to learn.