Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zhenlin Xu; Yi Zhu; Tiffany Deng; Abhay Mittal; Yanbei Chen; Manchen Wang; Paolo Favaro; Joseph Tighe; Davide Modolo

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yanbei Chen, Manchen Wang, Paolo Favaro, Joseph Tighe, Davide Modolo

TL;DR

This work targets open-world zero-shot recognition by vision-language models, focusing on two core challenges: semantic granularity and text specificity. It introduces two benchmarks—granularity consistency across a semantic hierarchy and specificity robustness via image-to-text retrieval with hard positives and negatives—to diagnose where current VLMs fall short, particularly for CLIP-family architectures. The study finds that models favor moderately fine-grained concepts, struggle with coarse-grained generalization, and produce similarity scores that can be misaligned with textual correctness; fine-tuning with hard samples yields limited, task-specific gains. The authors propose directions for improvement, including more balanced training data distributions, advanced cross-modality fusion strategies, and leveraging large language models to broaden generalization, supported by a two-level granularity benchmark and a language-only analysis to guide future research in robust open-world recognition. The cross-modality score is analyzed as $f(x_v, x_t) = E_v(x_v) \odot E_t(x_t)$, with propagation schemes $S^{\text{child}}$ and $S^{\text{leaf}}$ used to bridge CG and FG concepts, and evaluation via $\text{mAP}$ on hierarchical labels; specificity assessments rely on $AP$/$mAP$ in MSCOCO with challenging prompts. The findings have practical implications for deploying VLMs in real-world scenarios, where accurate alignment and generalization across varied linguistic expressions are critical for reliable open-world perception.

Abstract

This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models.

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

TL;DR

, with propagation schemes

and

used to bridge CG and FG concepts, and evaluation via

on hierarchical labels; specificity assessments rely on

in MSCOCO with challenging prompts. The findings have practical implications for deploying VLMs in real-world scenarios, where accurate alignment and generalization across varied linguistic expressions are critical for reliable open-world perception.

Abstract

Paper Structure (20 sections, 1 equation, 5 figures, 6 tables)

This paper contains 20 sections, 1 equation, 5 figures, 6 tables.

Introduction
Related Works
Zero-shot visual recognition
Benchmarking vision-language models
Zero-shot Visual Recognition With Vision-Language Models
Granularity Consistency of Vision-Language Models
Measure performance discrepancy on a semantic hierarchy
Dataset
Evaluation Protocol
Results and Analysis
Evaluate Specificity Robustness
Evaluation protocol and dataset
Results and implications
Limitations of Fine-Tuning VLMs with Hard Samples
Conclusion and Discussion
...and 5 more sections

Figures (5)

Figure 1: Left: Zero-shot models should recognize images with fine-grained (FG) concepts such as "Leopard", as well as coarse-grained (CG) concepts like "Feline" However, they often exhibit performance discrepancies on concepts at different levels of granularity. Right: Zero-shot models should recognize whether the text correctly describe the given image. However, vision-language models could be sensitive to the specificity of text and struggle to distinguish between the challenging positive like single-label prompts and hard negatives like poisoned captions with small changes.
Figure 2: Illustrations on the two ways to propagate scores on the semantic hierarchy. (a) Raw scores without propagation. (b) Propagate the max score from direct children classes. For example, 0.35 = max(0.17, 0.35) (c) Propagate the max score from leaf classes. For example, 0.48 = max(0.16, 0.10, 0.13, 0.48, 0.31)
Figure 3: Left: The box-plot of zero-shot classification performance (mAP) for leaf class over the level in the semantic hierarchy. Middle: The box-plot of classification performance (mAP) for ancestor classes over the level in the semantic tree. Note that level 0 and level 1 have 1 and 2 classes respectively and easy to get high mAP. Right:The box-plot of improved zero-shot classification performance (mAP) for ancestor class by propagating from leaf classes, over the level in the semantic tree.
Figure 4: Left: The scatter-plot of the frequency of class names in pre-training captions over the level in the semantic tree. Course-grained and overly fine-grained concepts are less presented in captions. Right: The scatter-plot of performance discrepancy over the frequency gap between ancestor class names and their leaf children. A positive correlation exists between the performance discrepancy and frequency gap (coefficient 0.43 with p-value 3.4e-39).
Figure 5: Left: Distribution of cross-modality scores with positive text: COCO captions, Localized-narratives captions, single-label, and multi-label prompts. Mismatched specificity in text (either too low or high) results in reduced scores. Middle: Distribution of scores with negative text: captions from random images, relevant images, and subtly altered captions. The altered captions attain high scores, similar to positive texts. Right: Score differences between single-label prompts and various negative texts, highlighting that correct single-label prompts often score lower than incorrect altered captions.

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

TL;DR

Abstract

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Authors

TL;DR

Abstract

Table of Contents

Figures (5)