Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Navid Rajabi; Jana Kosecka

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Navid Rajabi, Jana Kosecka

TL;DR

This work proposes an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause.

Abstract

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text matching performance. We propose an alternative fine-grained, compositional approach for recognizing and ranking spatial clauses that combines the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative VLMs (such as LXMERT, GPV, and MDETR) and compare and highlight their abilities to reason about spatial relationships.

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 7 figures, 6 tables)

This paper contains 14 sections, 2 equations, 7 figures, 6 tables.

Introduction
Probing LXMERT on VSR
Approach
Experiments
Related Works
Conclusions and Future Works
Implementations Details
LXMERT Zero-shot vs. Fine-Tuning Settings
Learning Curves of Fine-tuning LXMERT on VSR (CLS-Head Only)
Full LXMERT Quantitative Results
LXMERT Explainability Results Visualization
Spatial Relationship Groupings
Confusion Matrix of Spatial Relations Classifier
GPV Localization Confidences (Multimodal Relevance/Objectness Scores)

Figures (7)

Figure 1: Although the ground-truth caption for this image is: "The bed is next to the bicycle", multiple different spatial clauses from the language domain can be inferred from the visual domain (image) as to fill out the spatial clauses correctly, like behind, left of, near, close to, touching, etc. This type of intrinsic ambiguity from the language side makes formulating the spatial reasoning task more challenging.
Figure 2: Our Approach consists of two main modules: (1) Grounding Module predicts the locations of objects along with their confidences, and MLP takes the bounding box coordinates and predicts the distribution of spatial relationships. These are then combined to compute the initial ranking of spatial clauses. (2) Re-ranking Module adjusts the ranking given the co-occurrence priors. This example shows the effectiveness of the Re-ranking Module in adjusting the spatial clause distribution (which brings $inside$ to the 1st rank), while the initial top-3 predictions were semantically correct, anyway.
Figure 3: LXMERT Fine-tuning and Zero-shot Performance Reproduction Experiments Design
Figure 4: Training curves for LXMERT Binary Classification Head Fine-tuning on VSR Train Set
Figure 5: LXMERT Relevancy Scores: The first row shows an example that both subject and object attentions imply successful grounding. The second row demonstrates relevant activations for the subject (potted plant) but irrelevant attention weights for bus. However, the third and fourth rows depict irrelevant attention weights for both subjects and objects, demonstrating inconsistency in LXMERT's fine-grained grounding while predicting the binary labels correctly.
...and 2 more figures

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

TL;DR

Abstract

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)