A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi
TL;DR
This work investigates how well vision-language models ground phrases to image regions and how this grounding relates to task success. By introducing three benchmarks—Touchdown SDR, KiloGram, and Flickr30k Entities—and two models (ViLT-Aligner and MDETR), it jointly evaluates task performance and phrase grounding, using correlation metrics to quantify alignment. The key finding is that strong task performance does not guarantee robust grounding or high task-grounding correlation, though grounding pre-training and, critically, dataset-specific grounding annotations can dramatically improve both grounding quality and the correlation with task success, sometimes with only a small fraction of grounding data. These results suggest that grounding abilities are learnable through targeted data and training regimes, and they provide datasets, models, and protocols to study and improve the reasoning processes of vision-language systems in a way that benefits generalization and interpretability.
Abstract
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.
