Table of Contents
Fetching ...

NL-Eye: Abductive NLI for Images

Mor Ventura, Michael Toker, Nitay Calderon, Zorik Gekhman, Yonatan Bitton, Roi Reichart

TL;DR

NL-Eye introduces a visual abductive reasoning benchmark that pairs a premise image with two hypothesis images to assess plausibility and generate explanations. Built with 350 carefully curated triplets (1,050 images) across six reasoning and temporal categories, it combines human-authored textual scenes, synthetic image generation, and rigorous validation. Experiments show humans outperform modern VLMs by large margins on both plausibility and explanations, indicating a substantial gap in visual-to-logical integration; textual reasoning is feasible, but visual interpretation remains the core bottleneck. The benchmark exposes vulnerabilities in current VLM architectures for real-world safety and verification tasks and provides a reproducible framework with multiple input strategies and evaluation protocols to drive future improvements.

Abstract

Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs' visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.

NL-Eye: Abductive NLI for Images

TL;DR

NL-Eye introduces a visual abductive reasoning benchmark that pairs a premise image with two hypothesis images to assess plausibility and generate explanations. Built with 350 carefully curated triplets (1,050 images) across six reasoning and temporal categories, it combines human-authored textual scenes, synthetic image generation, and rigorous validation. Experiments show humans outperform modern VLMs by large margins on both plausibility and explanations, indicating a substantial gap in visual-to-logical integration; textual reasoning is feasible, but visual interpretation remains the core bottleneck. The benchmark exposes vulnerabilities in current VLM architectures for real-world safety and verification tasks and provides a reproducible framework with multiple input strategies and evaluation protocols to drive future improvements.

Abstract

Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs' visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.
Paper Structure (61 sections, 2 equations, 17 figures, 13 tables)

This paper contains 61 sections, 2 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: NL-Eye evaluates the abductive reasoning capabilities of VLMs. The main setup involves a premise image and two hypothesis images, where the model is tasked with inferring which hypothesis is more plausible, and to provide an explanation for its choice.
  • Figure 2: Fully annotated examples from NL-Eye. Each example includes the three images, the textual descriptions (prompts) used to generate them, the gold label, an explanation for why the gold is more plausible, and indications of the reasoning category and temporal direction and duration.
  • Figure 3: Real examples from each reasoning category in NL-Eye. The more plausible hypotheses are framed in green. The gold explanations are included below each sample.
  • Figure 4: NL-Eye data curation workflow scheme. The process includes three steps: (1) writing textual descriptions, (2) generating images, and (3) generating explanation and categorization. Yellow denotes steps that require human involvement while turquoise denotes model-based generations.
  • Figure 5: In the triplet setup (left), the input of the VLM is a triplet of premise and two hypotheses images, and its task is to predict and explain which hypothesis is more plausible. We provide the triplet two times with different orders of the hypotheses (e.g., see A and B), and only if it is consistent and predicts the correct hypothesis for both we consider it an accurate prediction. In the pairs setup (right), the input is a premise and hypothesis, and the VLM should output a plausibility score. For the same premise and two hypotheses, the predictions of the VLM are considered order-faithful and accurate if the correct hypothesis is scored higher than the wrong one.
  • ...and 12 more figures