Table of Contents
Fetching ...

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez

TL;DR

This paper tackles visual grounding under limited annotated data by aligning gradient-based explanations with human-regions via a margin-based Attention Mask Consistency (AMC) objective. Building on vision-language transformers like ALBEF and GradCAM, AMC optimizes two heatmap-based margins, L_mean and L_max, to push explanation energy inside annotated regions while maintaining flexibility for complex region shapes. Empirically, AMC achieves state-of-the-art pointing-game accuracy on Flickr30k and strong results on RefCOCO+, outperforming detector-based grounding methods and demonstrating strong performance on detector-free setups. The approach is simple to implement, compatible with existing VLMs, and adaptable to various region annotations, offering a practical path to more reliable grounding in real-world applications.

Abstract

We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.

Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

TL;DR

This paper tackles visual grounding under limited annotated data by aligning gradient-based explanations with human-regions via a margin-based Attention Mask Consistency (AMC) objective. Building on vision-language transformers like ALBEF and GradCAM, AMC optimizes two heatmap-based margins, L_mean and L_max, to push explanation energy inside annotated regions while maintaining flexibility for complex region shapes. Empirically, AMC achieves state-of-the-art pointing-game accuracy on Flickr30k and strong results on RefCOCO+, outperforming detector-based grounding methods and demonstrating strong performance on detector-free setups. The approach is simple to implement, compatible with existing VLMs, and adaptable to various region annotations, offering a practical path to more reliable grounding in real-world applications.

Abstract

We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.
Paper Structure (18 sections, 9 equations, 6 figures, 7 tables)

This paper contains 18 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Gradient-based methods can generate heatmaps that explain the match between images and text for a Vision-language model (VLM). Our work aims to improve their ability to produce visual groundings by directly optimizing their gradient-based explanations so that they are consistent with human annotations provided for a reduced set of images.
  • Figure 2: Overview of our method. Among other objectives, standard vision-language models are trained to produce a matching score $y$ given an input image-text pair $(V,T)$. For inputs containing an extra level of supervision in the form of region annotations (e.g. a triplet $(V,T,M)$), where M is a binary mask indicating the regions annotated by a human, we optimize the GradCAM selvaraju2017grad gradient-based explanations of the model so that the produced explanations are consistent with region annotations using $\mathcal{L}_{\text{amc}}$ by maximizing the energy in the heatmap that falls inside the region annotation and minimizing what falls outside. We accomplish this through soft margin losses as described in Sec. \ref{['subsec:amc']}.
  • Figure 3: Qualitative comparison of the generated explanations for various images and input phrases. First column: original images from Flickr30k Entities; in each colored area from left to right: bounding boxes selected by VMRM; heatmaps generated by gALBEF; heatmaps generated by our method. On the top of each group of images, we show the caption and target phrases.
  • Figure 4: We show some constructed textual descriptions with colored attributes and spatial references.
  • Figure 5: We show more qualitative examples for the RefCOCO+ testing set. Ground truth boxes are marked as red boxes. Below each image we provide with one input phrase.
  • ...and 1 more figures