Table of Contents
Fetching ...

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Minkyu Kim, Sangheon Lee, Dongmin Park

TL;DR

Through extensive evaluation of both proprietary and open-source VLMs, this work reveals systematic gaps between model and human performance across difference types and domains, and provides controlled analyses highlighting where VLMs'reasoning sharply deteriorates.

Abstract

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

TL;DR

Through extensive evaluation of both proprietary and open-source VLMs, this work reveals systematic gaps between model and human performance across difference types and domains, and provides controlled analyses highlighting where VLMs'reasoning sharply deteriorates.

Abstract

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial, and medical imagery. Through extensive evaluation of both proprietary and open-source VLMs, we reveal systematic gaps between model and human performance across difference types and domains, and provide controlled analyses highlighting where VLMs' reasoning sharply deteriorates. Together, our benchmark and findings establish a foundation for advancing VLMs toward human-level comparative reasoning.
Paper Structure (64 sections, 2 equations, 21 figures, 13 tables)

This paper contains 64 sections, 2 equations, 21 figures, 13 tables.

Figures (21)

  • Figure 1: Comparison of VLM-SubtleBench and MLLM-CompBench with GPT-4o.
  • Figure 2: Example tasks from the VLM-SubtleBench, covering ten difference categories (Attribute, State, Emotion, Temporal, Spatial, Existence, Quality, Quantity, Viewpoint, Action) and six domains (natural, game, medical, industry, aerial, synthetic). For each example, the correct answer is highlighted in bold green. Model responses from GPT-5-main, Claude-sonnet-4, and Gemini-2.5-pro are shown beneath each question in order. Some VQA instances are simplified due to space constraints; full versions and additional examples are provided in the appendix.
  • Figure 3: Data Curation Pipeline of VLM-SubtleBench.
  • Figure 3: Performance of open-source and proprietary vision-language models in VLM-SubtleBench captioning.
  • Figure 4: Statistics of the test split of VLM-SubtleBench.
  • ...and 16 more figures