A Surprising Failure? Multimodal LLMs and the NLVR Challenge
Anne Wu, Kianté Brantley, Yoav Artzi
TL;DR
This paper probes whether current multimodal LLMs can robustly handle NLVR, a benchmark crafted to require precise compositional and spatial reasoning and to resist semantic biases. It evaluates GPT-4V, Gemini Pro, and IDEFICS on the NLVR Test-P split using zero- and five-shot prompts, including a finetuned IDEFICS variant with 4-bit QLoRA. The findings show substantial gaps to human performance, with GPT-4V zero-shot achieving the best among the three (~60%), IDEFICS close when finetuned (~60%), and Gemini Pro showing bias tendencies and mixed gains. The work highlights that prompting and even modest finetuning do not solve NLVR’s core challenges, indicating a need for more fundamental advances in modeling spatial-compositional reasoning in multimodal systems.
Abstract
This study evaluates three state-of-the-art MLLMs -- GPT-4V, Gemini Pro, and the open-source model IDEFICS -- on the compositional natural language vision reasoning task NLVR. Given a human-written sentence paired with a synthetic image, this task requires the model to determine the truth value of the sentence with respect to the image. Despite the strong performance demonstrated by these models, we observe they perform poorly on NLVR, which was constructed to require compositional and spatial reasoning, and to be robust for semantic and systematic biases.
