Table of Contents
Fetching ...

CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng

TL;DR

CompareBench introduces a 1,000-question benchmark for visual comparison across quantity, temporal, geometric, and spatial tasks, derived from TallyBench and HistCaps, to diagnose visual comparison reasoning in vision-language models. The study shows scaling helps but fundamental gaps persist, especially in temporal and spatial reasoning, with counting and geometry remaining error-prone for current systems. It provides a controlled, diagnostic framework with four sub-benchmarks and standardized prompts to isolate perception-driven and knowledge-driven challenges. The findings highlight CVLMs' blind spot in comparison reasoning and position CompareBench as a tool to guide development of more robust, transparent multimodal systems.

Abstract

We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models

TL;DR

CompareBench introduces a 1,000-question benchmark for visual comparison across quantity, temporal, geometric, and spatial tasks, derived from TallyBench and HistCaps, to diagnose visual comparison reasoning in vision-language models. The study shows scaling helps but fundamental gaps persist, especially in temporal and spatial reasoning, with counting and geometry remaining error-prone for current systems. It provides a controlled, diagnostic framework with four sub-benchmarks and standardized prompts to isolate perception-driven and knowledge-driven challenges. The findings highlight CVLMs' blind spot in comparison reasoning and position CompareBench as a tool to guide development of more robust, transparent multimodal systems.

Abstract

We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.

Paper Structure

This paper contains 10 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of TallyBench, HistCaps, and CompareBench with representative GPT-5 failure cases. CompareBench (bottom) encompasses four fundamental comparison tasks: geometric, spatial, quantity, and temporal sequence reasoning. TallyBench (top-left) is designed for object counting and also forms the basis of the quantity comparison task in CompareBench. HistCaps (top-right), annotated with temporal tags and bilingual captions, serves as the foundation for the temporal sequence comparison task. Although trivial for humans, GPT-5 consistently fails on these tasks. For example, it underestimates stacked cups, overestimates the number of chickens, misjudges book thickness, misestimates relative object height, miscounts in comparative settings, and incorrectly orders historical events, highlighting systematic limitations in visual comparison reasoning.
  • Figure 2: Distribution of TallyBench categories. The top level splits into Biology (900) and Artificial Objects (1100). Subcategories include animal, plant, people, food & beverage, electronics, clothing, transportation, ball, household item, etc. The outer ring further specifies around 50 fine-grained classes, such as Dog (40), Cat (60), Chicken (100), Book (100), Spoon (50), and Knife (50).
  • Figure 3: Category distribution of CompareBench. The inner ring represents the four sub-benchmarks: CompareTallyBench (600), CompareTemporalBench (100), CompareGeometryBench (200), and CompareSpatialBench (100). CompareTallyBench inherits diverse categories from TallyBench, including animals, plants, people, food & beverages, electronics, clothing, transportation, household items, etc. The outer ring further decomposes the geometric tasks into five fine-grained types that capture intrinsic object properties, including length, width, height, thickness, and diameter (40 samples each). The spatial tasks are divided into depth (object/point distance to the camera) and vertical height (object/point distance above the ground), with 50 samples each.
  • Figure 4: TallyBench hard cases where all four models fail. Each panel shows the image, the counting question (e.g., "How many spoons are in the image?"), and the predictions from four models (Claude Sonnet 4, Gemini 2.5 Pro, GPT-5, Qwen2.5-VL-72B-Instruct), all of which are incorrect. The six examples (top-left to bottom-right) cover spoons, Labubu instances, sheep/goats, stacked books, birds, and chickens. These cases illustrate typical counting failure modes, including confusing visually similar instances, missing partially occluded objects, and misreading fine-scale duplicates, despite such tasks being trivial for humans.
  • Figure 5: Failure cases from the four CompareBench sub-benchmarks. Each panel shows a sample question and predictions from four state-of-the-art VLMs, all of which are incorrect. From left to right on the first line: (Geometry) identifying the taller book; (Spatial) deciding which marked point is higher above ground; (Tally) comparing the quantity of Labubus; (Temporary) selecting the earliest historical scene. These cases highlight persistent weaknesses in comparative reasoning across dimensions of size, space, quantity, and time.