Table of Contents
Fetching ...

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models

Kevin Miller, Samarth Mishra, Aditya Gangrade, Kate Saenko, Venkatesh Saligrama

TL;DR

SPARC introduces a zero-shot, training-free solution for multi-label recognition with Vision-Language Models by querying compound prompts and applying debiasing plus adaptive fusion of scores. It reveals that VLMs often exhibit OR-like behavior with a small AND bonus in compound prompts, motivating normalization and a rank-variance-based fusion to extract robust signals, including a key insight that the second-highest compound score is often more discriminative than the maximum. The method demonstrates strong mAP gains across COCO, VOC, and NUS-WIDE and across nine CLIP backbones, while remaining complementary to other zero-shot techniques. A theory-backed treatment explains the weakened-max phenomenon and provides conditions under which second-max fusion is advantageous, underscoring SPARC’s broader implications for score-based decoding in black-box VLMs and its practical impact as a plug-and-play, training-free enhancement.

Abstract

Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and ``AND''/``OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.

SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models

TL;DR

SPARC introduces a zero-shot, training-free solution for multi-label recognition with Vision-Language Models by querying compound prompts and applying debiasing plus adaptive fusion of scores. It reveals that VLMs often exhibit OR-like behavior with a small AND bonus in compound prompts, motivating normalization and a rank-variance-based fusion to extract robust signals, including a key insight that the second-highest compound score is often more discriminative than the maximum. The method demonstrates strong mAP gains across COCO, VOC, and NUS-WIDE and across nine CLIP backbones, while remaining complementary to other zero-shot techniques. A theory-backed treatment explains the weakened-max phenomenon and provides conditions under which second-max fusion is advantageous, underscoring SPARC’s broader implications for score-based decoding in black-box VLMs and its practical impact as a plug-and-play, training-free enhancement.

Abstract

Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications. Existing approaches require prompt tuning or architectural adaptations, limiting zero-shot applicability. Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. Using large language model insights on object co-occurrence, we introduce compound prompts grounded in realistic object combinations. Analysis of these prompt scores reveals VLM biases and ``AND''/``OR'' signal ambiguities, notably that maximum compound scores are surprisingly suboptimal compared to second-highest scores. We address these through a debiasing and score-fusion algorithm that corrects image bias and clarifies VLM response behaviors. Our method enhances other zero-shot approaches, consistently improving their results. Experiments show superior mean Average Precision (mAP) compared to methods requiring training data, achieved through refined object ranking for robust zero-shot MLR.

Paper Structure

This paper contains 33 sections, 8 theorems, 37 equations, 7 figures, 11 tables, 2 algorithms.

Key Result

Theorem 1

Given the assumptions above, plus the additional assumption that $\textrm{Pr}(\tilde{y}_0^{+} \neq y_0^{+} \bigvee \tilde{y}_0^{-} \neq y_0^{-}) > 0$, we can guarantee that $\textrm{Pr}(W_2) > \textrm{Pr}(W_1)$ for sufficiently large $m$.

Figures (7)

  • Figure 1: (top) Vision Language models(VLMs) like CLIP can be used for zero-shot classification with image-text similarity scores. While this works fairly well for single-class labels, they can struggle in the multi-label scenario. (bottom) In this paper, we instroduce SPARC, our solution that functions on top of an existing VLM, treating it simply as a black-box score generator. Using class names, SPARC first creates compound prompts for additional queries to the VLM. It then debiases, ranks and appropriately fuses them to generate final scores for the original classes.
  • Figure 2: A motivating example with an image where class "cat" is absent (left) and one where it is present (right). The highest compound prompt score is an unhelpful signal because it gives a high score to both negatives and positives, while the second-highest is more discriminative. Our method adaptively fuses the most informative order statistics, resulting in a strong signal.
  • Figure 3: Per-class APs (averaged over all CLIP backbones) for our method vs vanilla ZSCLIP on the COCO dataset. Our method consistently improves over ZSCLIP for almost every class. Plots for VOC and NUSWIDE are shown in the Supplementary.
  • Figure 4: Average mAP for different Rank Fusion strategies, without (top) and with (bottom) the "merge" step \ref{['eq:merge']} demonstrates superiority of adaptive fusion over fixed strategies.
  • Figure 5: Histograms for singleton, 1st max, and 2nd max scores for "cat" in the COCO dataset. We see that 1st max creates overlap by lifting the scores of some ground-truth negatives. 2nd max does not create these issues and performs well when fused with singleton scores.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 6 more