Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
Ziyan Yang, Kushal Kafle, Franck Dernoncourt, Vicente Ordonez
TL;DR
This paper tackles visual grounding under limited annotated data by aligning gradient-based explanations with human-regions via a margin-based Attention Mask Consistency (AMC) objective. Building on vision-language transformers like ALBEF and GradCAM, AMC optimizes two heatmap-based margins, L_mean and L_max, to push explanation energy inside annotated regions while maintaining flexibility for complex region shapes. Empirically, AMC achieves state-of-the-art pointing-game accuracy on Flickr30k and strong results on RefCOCO+, outperforming detector-based grounding methods and demonstrating strong performance on detector-free setups. The approach is simple to implement, compatible with existing VLMs, and adaptable to various region annotations, offering a practical path to more reliable grounding in real-world applications.
Abstract
We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.
