CompareBench: A Benchmark for Visual Comparison Reasoning in Vision-Language Models
Jie Cai, Kangning Yang, Lan Fu, Jiaming Ding, Jinlong Li, Huiming Sun, Daitao Xing, Jinglin Shen, Zibo Meng
TL;DR
CompareBench introduces a 1,000-question benchmark for visual comparison across quantity, temporal, geometric, and spatial tasks, derived from TallyBench and HistCaps, to diagnose visual comparison reasoning in vision-language models. The study shows scaling helps but fundamental gaps persist, especially in temporal and spatial reasoning, with counting and geometry remaining error-prone for current systems. It provides a controlled, diagnostic framework with four sub-benchmarks and standardized prompts to isolate perception-driven and knowledge-driven challenges. The findings highlight CVLMs' blind spot in comparison reasoning and position CompareBench as a tool to guide development of more robust, transparent multimodal systems.
Abstract
We introduce CompareBench, a benchmark for evaluating visual comparison reasoning in vision-language models (VLMs), a fundamental yet understudied skill. CompareBench consists of 1000 QA pairs across four tasks: quantity (600), temporal (100), geometric (200), and spatial (100). It is derived from two auxiliary datasets that we constructed: TallyBench (2000 counting images with QA) and HistCaps (515 historical images with bilingual captions). We evaluate both closed-source APIs (OpenAI, Gemini, Claude) and open-source models (Qwen2.5-VL and Qwen3-VL series). Results show clear scaling trends but also reveal critical limitations: even the strongest models consistently fail at temporal ordering and spatial relations, and they often make mistakes in basic counting and geometric comparisons that are trivial for humans. These findings demonstrate that visual comparison remains a systematic blind spot for current VLMs. By providing controlled, diverse, and diagnostic evaluation, CompareBench establishes a foundation for advancing more reliable multimodal reasoning.
