Table of Contents
Fetching ...

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Navid Rajabi, Jana Kosecka

TL;DR

This work proposes an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause.

Abstract

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text matching performance. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

TL;DR

This work proposes an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause.

Abstract

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text matching performance. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.
Paper Structure (14 sections, 2 equations, 7 figures, 6 tables)

This paper contains 14 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Although the ground-truth caption for this image is: "The bed is next to the bicycle", multiple different spatial clauses from the language domain can be inferred from the visual domain (image) as to fill out the spatial clauses correctly, like behind, left of, near, close to, touching, etc. This type of intrinsic ambiguity from the language side makes formulating the spatial reasoning task more challenging.
  • Figure 2: Our Approach consists of two main modules: (1) Grounding Module predicts the locations of objects along with their confidences, and MLP takes the bounding box coordinates and predicts the distribution of spatial relationships. These are then combined to compute the initial ranking of spatial clauses. (2) Re-ranking Module adjusts the ranking given the co-occurrence priors. This example shows the effectiveness of the Re-ranking Module in adjusting the spatial clause distribution (which brings $inside$ to the 1st rank), while the initial top-3 predictions were semantically correct, anyway.
  • Figure 3: LXMERT Fine-tuning and Zero-shot Performance Reproduction Experiments Design
  • Figure 4: Training curves for LXMERT Binary Classification Head Fine-tuning on VSR Train Set
  • Figure 5: LXMERT Relevancy Scores: The first row shows an example that both subject and object attentions imply successful grounding. The second row demonstrates relevant activations for the subject (potted plant) but irrelevant attention weights for bus. However, the third and fourth rows depict irrelevant attention weights for both subjects and objects, demonstrating inconsistency in LXMERT's fine-grained grounding while predicting the binary labels correctly.
  • ...and 2 more figures