Table of Contents
Fetching ...

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Noriyuki Kojima, Hadar Averbuch-Elor, Yoav Artzi

TL;DR

This work investigates how well vision-language models ground phrases to image regions and how this grounding relates to task success. By introducing three benchmarks—Touchdown SDR, KiloGram, and Flickr30k Entities—and two models (ViLT-Aligner and MDETR), it jointly evaluates task performance and phrase grounding, using correlation metrics to quantify alignment. The key finding is that strong task performance does not guarantee robust grounding or high task-grounding correlation, though grounding pre-training and, critically, dataset-specific grounding annotations can dramatically improve both grounding quality and the correlation with task success, sometimes with only a small fraction of grounding data. These results suggest that grounding abilities are learnable through targeted data and training regimes, and they provide datasets, models, and protocols to study and improve the reasoning processes of vision-language systems in a way that benefits generalization and interpretability.

Abstract

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

TL;DR

This work investigates how well vision-language models ground phrases to image regions and how this grounding relates to task success. By introducing three benchmarks—Touchdown SDR, KiloGram, and Flickr30k Entities—and two models (ViLT-Aligner and MDETR), it jointly evaluates task performance and phrase grounding, using correlation metrics to quantify alignment. The key finding is that strong task performance does not guarantee robust grounding or high task-grounding correlation, though grounding pre-training and, critically, dataset-specific grounding annotations can dramatically improve both grounding quality and the correlation with task success, sometimes with only a small fraction of grounding data. These results suggest that grounding abilities are learnable through targeted data and training regimes, and they provide datasets, models, and protocols to study and improve the reasoning processes of vision-language systems in a way that benefits generalization and interpretability.

Abstract

Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and at available at https://github.com/lil-lab/phrase_grounding.
Paper Structure (41 sections, 3 equations, 12 figures, 5 tables)

This paper contains 41 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We jointly study the task performance and phrase grounding of the vision and language models. Above, we illustrate our approach in one of our benchmarks: Touchdown SDR. Top Left: The task is to locate the hidden Touchdown in an urban panorama using a text description. Top Right: The Touchdown SDR dataset is expanded with 167k manual annotations of bounding boxes to support the study. Bottom: ViLT simultaneously locates Touchdown and demonstrates reasoning processes through phrase grounding.
  • Figure 2: Illustrations of reference games with phrase grounding in the KiloGram (left) and Flickr30k Entities (right) benchmarks. Given the input text (framed at the bottom) and the context (a single column of images), the task is to select the referenced image. We replicate each context multiple times to illustrate phrase grounding for multiple phrases from the input text (depicted below each column, except the left column in each example). Red bounding boxes show ground truth predictions for both the task (left column) and phrase grounding (remaining columns). Blue bounding boxes show model predictions MDETR for KiloGram and ViLT for Flickr30k Entities. In addition, green masks show pixel-wise segmentation predictions made by ViLT for Flickr30k Entities.
  • Figure 3: Fine-tuning ViLT-Aligner with varying amounts of dataset-specific phrase grounding annotations. In the figure, the $x$-axis indicates the proportion of phrases annotated with bounding boxes, while the $y$-axes represent the metrics for phrase-grounding, task-grounding correlation, and the task performances.
  • Figure 4: End-task success and failure illustration of a system that achieves strong phrase grounding performance and high task-grounding correlation in Touchdown SDR. We illustrate the outputs of ViLT-Aligner (overlaid in green); this model is initialized with phrase grounding pre-training and fine-tuned with dataset-specific phrase grounding annotations. The illustration is over examples not seen during training.
  • Figure 5: Annotated examples in Touchdown SDR.
  • ...and 7 more figures