A Surprising Failure? Multimodal LLMs and the NLVR Challenge

Anne Wu; Kianté Brantley; Yoav Artzi

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

Anne Wu, Kianté Brantley, Yoav Artzi

TL;DR

This paper probes whether current multimodal LLMs can robustly handle NLVR, a benchmark crafted to require precise compositional and spatial reasoning and to resist semantic biases. It evaluates GPT-4V, Gemini Pro, and IDEFICS on the NLVR Test-P split using zero- and five-shot prompts, including a finetuned IDEFICS variant with 4-bit QLoRA. The findings show substantial gaps to human performance, with GPT-4V zero-shot achieving the best among the three (~60%), IDEFICS close when finetuned (~60%), and Gemini Pro showing bias tendencies and mixed gains. The work highlights that prompting and even modest finetuning do not solve NLVR’s core challenges, indicating a need for more fundamental advances in modeling spatial-compositional reasoning in multimodal systems.

Abstract

This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases.

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

TL;DR

Abstract

Paper Structure (13 sections, 2 figures, 4 tables)

This paper contains 13 sections, 2 figures, 4 tables.

Introduction
Task Background: NLVR
Experimental Setup
Models
GPT-4 Turbo with Vision (GPT-4V)
Gemini Pro
IDEFICS
Model Evaluation Details
Prompts Selection
Results & Analysis
Conclusion
Data Statistics
Selected Prompts

Figures (2)

Figure 1: Examples of sentence-image pairs from the NLVR corpus. The left sentence has a truth value of True with respect to the Tower image. The right sentence has a truth value of False with respect to the Scatter image.
Figure 2: Test-P accuracies with zero- and five-shot prompting, split by image type (Tower or Scatter).

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

TL;DR

Abstract

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (2)