Table of Contents
Fetching ...

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

Anne Wu, Kianté Brantley, Yoav Artzi

TL;DR

This paper probes whether current multimodal LLMs can robustly handle NLVR, a benchmark crafted to require precise compositional and spatial reasoning and to resist semantic biases. It evaluates GPT-4V, Gemini Pro, and IDEFICS on the NLVR Test-P split using zero- and five-shot prompts, including a finetuned IDEFICS variant with 4-bit QLoRA. The findings show substantial gaps to human performance, with GPT-4V zero-shot achieving the best among the three (~60%), IDEFICS close when finetuned (~60%), and Gemini Pro showing bias tendencies and mixed gains. The work highlights that prompting and even modest finetuning do not solve NLVR’s core challenges, indicating a need for more fundamental advances in modeling spatial-compositional reasoning in multimodal systems.

Abstract

This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases.

A Surprising Failure? Multimodal LLMs and the NLVR Challenge

TL;DR

This paper probes whether current multimodal LLMs can robustly handle NLVR, a benchmark crafted to require precise compositional and spatial reasoning and to resist semantic biases. It evaluates GPT-4V, Gemini Pro, and IDEFICS on the NLVR Test-P split using zero- and five-shot prompts, including a finetuned IDEFICS variant with 4-bit QLoRA. The findings show substantial gaps to human performance, with GPT-4V zero-shot achieving the best among the three (~60%), IDEFICS close when finetuned (~60%), and Gemini Pro showing bias tendencies and mixed gains. The work highlights that prompting and even modest finetuning do not solve NLVR’s core challenges, indicating a need for more fundamental advances in modeling spatial-compositional reasoning in multimodal systems.

Abstract

This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases.
Paper Structure (13 sections, 2 figures, 4 tables)

This paper contains 13 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Examples of sentence-image pairs from the NLVR corpus. The left sentence has a truth value of True with respect to the Tower image. The right sentence has a truth value of False with respect to the Scatter image.
  • Figure 2: Test-P accuracies with zero- and five-shot prompting, split by image type (Tower or Scatter).