Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Navid Rajabi; Jana Kosecka

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Navid Rajabi, Jana Kosecka

TL;DR

This work tackles the challenge of grounding in Vision-Language Models by identifying shortcomings of the Pointing Game as a grounding metric and introducing GradCAM-based, quantitative measures. It defines a family of metrics, including $IoU_{Soft}$, $Dice_{Soft}$, $WDP$, $IO_{ratio}$, and $PG_{Uncertainty}$, to rigorously quantify how well GradCAM activations align with ground-truth regions and to capture grounding uncertainty across models. Through evaluations of four state-of-the-art VLMs (BLIP$_{base}$, BLIP$_{large}$, CLIP$_{gScoreCAM}$, ALBEF$_{AMC}$) on ID datasets (Flickr30K Entities, RefCOCO+) and an OOD dataset (SpatialSense), the study finds ALBEF$_{AMC}$ often achieves the best overall grounding performance, while CLIP-based methods may produce cleaner localizations but more spurious activations. The proposed metrics enable finer-grained, explainable grounding comparisons beyond binary accuracy and can guide the development of grounding-aware VLMs for phrase grounding, referring expressions, and spatial relation tasks in both ID and OOD contexts.

Abstract

Vision and Language Models (VLMs) continue to demonstrate remarkable zero-shot (ZS) performance across various tasks. However, many probing studies have revealed that even the best-performing VLMs struggle to capture aspects of compositional scene understanding, lacking the ability to properly ground and localize linguistic phrases in images. Recent VLM advancements include scaling up both model and dataset sizes, additional training objectives and levels of supervision, and variations in the model architectures. To characterize the grounding ability of VLMs, such as phrase grounding, referring expressions comprehension, and relationship understanding, Pointing Game has been used as an evaluation metric for datasets with bounding box annotations. In this paper, we introduce a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding capabilities of pre-trained VLMs like CLIP, BLIP, and ALBEF. These metrics offer an explainable and quantifiable approach for a more detailed comparison of the zero-shot capabilities of VLMs and enable measuring models' grounding uncertainty. This characterization reveals interesting tradeoffs between the size of the model, the dataset size, and their performance.

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

TL;DR

, and

, to rigorously quantify how well GradCAM activations align with ground-truth regions and to capture grounding uncertainty across models. Through evaluations of four state-of-the-art VLMs (BLIP

, BLIP

, CLIP

, ALBEF

) on ID datasets (Flickr30K Entities, RefCOCO+) and an OOD dataset (SpatialSense), the study finds ALBEF

often achieves the best overall grounding performance, while CLIP-based methods may produce cleaner localizations but more spurious activations. The proposed metrics enable finer-grained, explainable grounding comparisons beyond binary accuracy and can guide the development of grounding-aware VLMs for phrase grounding, referring expressions, and spatial relation tasks in both ID and OOD contexts.

Abstract

Paper Structure (8 sections, 5 equations, 14 figures, 1 table)

This paper contains 8 sections, 5 equations, 14 figures, 1 table.

Introduction
Method
Experiments
Discussion
Conclusion
Experiments Details
Sample Qualitative Results
Histograms of Score Distributions

Figures (14)

Figure 1: Uncertainty in Pointing Game (PG) accuracy, when having multiple top-k identical activations with inconsistent PG binary labels (Scenario 1). As depicted in the bottom-right figure, three top high-confidence activations exist, each with a value of 1.0, after our NMS analysis. One falls outside the bounding box, one inside, and one at the border. In these cases, PG lacks any additional clues or heuristics to determine which one to select.
Figure 2: Given the "his daughter" prompt, PG returns the same accuracy of 1 for all four model outputs in a discrete manner, and overlooks the differences in their holistic grounding qualities (Scenario 2). On the other hand, our $\mathbf{IO}_{ratio}$ metric can differentiate and rank them in a more explainable and continuous manner by quantifying them each as a single normalized value between 0 and 1.
Figure 3: Histogram of $\mathrm{IoU}_{Soft}$ and $\mathrm{IO}_{ratio}$ distributions for ID vs. OOD. Note that the histograms are more peaked for in-distribution datasets, as shown in blue on the left, and for better-performing models, they are shifted to the right. The out-of-distribution experiments for all models have less peaked, flatter histograms, where shown in orange on the right. Full visualizations can be found in Appendix \ref{['app:histograms']}.
Figure 4: A sample from SpatialSense NYU bedroom set. We consider the "wifi router to the right of television" prompt as Triplet, "wifi router" as Subject, and "television" as Object.
Figure 5: RefCOCO+ (testA) - two sample qualitative results.
...and 9 more figures

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

TL;DR

Abstract

Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM

Authors

TL;DR

Abstract

Table of Contents

Figures (14)