Table of Contents
Fetching ...

Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

Samyak Rawlekar, Shubhang Bhatnagar, Narendra Ahuja

TL;DR

This work interrogates prompting strategies for multi-label recognition with partial annotations by separating the contributions of positive and negative prompts. It introduces PositiveCoOp and NegativeCoOp to isolate the impact of CLIP-guided prompts, finding that negative prompts degrade performance while positive prompts paired with learned negative embeddings yield the best results. A vision-only baseline demonstrates strong performance with substantially lower computational cost, especially when label availability is high. The analysis suggests that the absence information is underrepresented in training data like LAION-400M, explaining why CLIP-guided negative prompts provide limited benefit, and highlighting the practical value of prompt-free or positive-prompt-based approaches in efficient MLR.

Abstract

Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters

Rethinking Prompting Strategies for Multi-Label Recognition with Partial Annotations

TL;DR

This work interrogates prompting strategies for multi-label recognition with partial annotations by separating the contributions of positive and negative prompts. It introduces PositiveCoOp and NegativeCoOp to isolate the impact of CLIP-guided prompts, finding that negative prompts degrade performance while positive prompts paired with learned negative embeddings yield the best results. A vision-only baseline demonstrates strong performance with substantially lower computational cost, especially when label availability is high. The analysis suggests that the absence information is underrepresented in training data like LAION-400M, explaining why CLIP-guided negative prompts provide limited benefit, and highlighting the practical value of prompt-free or positive-prompt-based approaches in efficient MLR.

Abstract

Vision-language models (VLMs) like CLIP have been adapted for Multi-Label Recognition (MLR) with partial annotations by leveraging prompt-learning, where positive and negative prompts are learned for each class to associate their embeddings with class presence or absence in the shared vision-text feature space. While this approach improves MLR performance by relying on VLM priors, we hypothesize that learning negative prompts may be suboptimal, as the datasets used to train VLMs lack image-caption pairs explicitly focusing on class absence. To analyze the impact of positive and negative prompt learning on MLR, we introduce PositiveCoOp and NegativeCoOp, where only one prompt is learned with VLM guidance while the other is replaced by an embedding vector learned directly in the shared feature space without relying on the text encoder. Through empirical analysis, we observe that negative prompts degrade MLR performance, and learning only positive prompts, combined with learned negative embeddings (PositiveCoOp), outperforms dual prompt learning approaches. Moreover, we quantify the performance benefits that prompt-learning offers over a simple vision-features-only baseline, observing that the baseline displays strong performance comparable to dual prompt learning approach (DualCoOp), when the proportion of missing labels is low, while requiring half the training compute and 16 times fewer parameters
Paper Structure (22 sections, 8 equations, 5 figures, 4 tables)

This paper contains 22 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Visualization of Similarity Maps. We compare similarity maps obtained using cosine similarity between image features and positive prompt features versus image features and negative prompt features for each class. The activation of similar regions in both maps questions the effectiveness of CLIP's guidance for learning a negative prompt.
  • Figure 2: Conceptual Comparison of MLR Approaches. In (a), we show the textual framework for existing VLM-based MLR approaches with partial annotations. They use CLIP's guidance to learn prompts for each class: a positive prompt associated with the presence of the class and a negative prompt associated with the absence of the class. To analyze the effect of the positive and negative guidance, we create two setups. In (b) we test the impact of positive guidance by removing the negative prompt and instead learn a negative embedding directly in feature space to detect class absence. The positive prompt, learned with CLIP, remains for detecting class presence. In (c), we test the impact of negative guidance by removing the positive prompt and instead learn a feature space embeddings to detect class presence. The negative prompt, learned with CLIP, is used to detect class absence.
  • Figure 2: Image Text pairs from LAION400M dataset. The descriptions of the images mainly focus on the objects(classes) present in the image, and do not describe the absence of objects (classes).
  • Figure 3: Baseline Framework. To quantify the impact of prompting based approaches in MLR with partial annotations, we setup up a baseline (sec \ref{['subsec:baseline']}) that uses only visual information. Given an image $\mathbf{x}_{i}$, with multiple objects, we first extract its features ($G_{\text{img}}(\mathbf{x}_i)$) using the frozen visual encoder of CLIP clip. These features are then passed through a linear projector layer ($\Phi$) that projects the d-dimensional features at location $(h,w)$ to two local logits per class for all $N$ classes, one logit indicating the presence of the class and another its absence. The local logits are aggregated across all spatial regions to produce the final positive and negative logits. We train the linear projector layer of the baseline using the widely used asymmetric loss asl.
  • Figure 4: PositiveCoOp and NegativeCoOp Overview. This figure illustrates the PositiveCoOp framework, with NegativeCoOp being its mirror image. VLM based MLR approaches like DualCoOp dualcoop propose to learn both positive and negative prompts using CLIP's guidance: one for class presence and one for class absence. In PositiveCoOp (NegativeCoOp), for a given class j only the positive (negative) prompt $\mathbf{t}_{j,+}$ ($\mathbf{t}_{j,-}$) is learned through CLIP, while the negative (positive) prompt is replaced by a learned text embedding $\mathbf{r}_{j,-}$ ($\mathbf{r}_{j,+}$) in the feature space, independent of CLIP's text encoder. For both PositiveCoOp and NegativeCoOp, we obtain the final predictions $\mathbf{\hat{p}_i^{j,+}}$ and $\mathbf{\hat{p}_i^{j,-}}$ by calculating the cosine similarity of the image features with the embedding of the positive text prompt $\mathbf{r}_{j,+}$ and learned text embedding $\mathbf{r}_{j,-}$ and then aggregating this using the the class specific feature aggregation strategy following dualcoop , described in detail Sec. \ref{['subsec:baseline']}. Only the text embeddings and the prompts are trained using the widely used Asymmetric Loss asl