Table of Contents
Fetching ...

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Haonan Han, Jiancheng Huang, Xiaopeng Sun, Junyan He, Rui Yang, Jie Hu, Xiaojiang Peng, Lin Ma, Xiaoming Wei, Xiu Li

Abstract

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

Abstract

Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/

Paper Structure

This paper contains 16 sections, 6 equations, 12 figures, 22 tables.

Figures (12)

  • Figure 1: An overview of ViGoR-Bench. (a) The data distribution across various domains. (b) Examples of the reasoning process from generation models. (c) Performance comparison of leading models on ViGoR-Bench.
  • Figure 2: An overview of the ViGoR-Bench construction and evaluation pipelines. (a) The benchmark dataset is constructed through a three-pronged approach: generative synthesis, real-world acquisition, and algorithmic generation. All data undergoes human review to establish definitive image-ground truth (GT) pairs. (b) For evaluation, a Multimodal Large Language Model (MLLM) is employed as an automated judge. Conditioned on the ground-truth image, the MLLM assesses both the Chain-of-Thought (CoT) reasoning process and the final output of generative models for images and videos.
  • Figure 3: Overview of the ViGoR-Bench task suite. We present representative demo cases and their corresponding editing instructions across 20 distinct sub-tasks. These tasks are hierarchically organized into three primary reasoning domains: Physical Reasoning, Knowledge Reasoning, and Symbolic Reasoning.
  • Figure 4: Qualitative comparison of leading models. We present case studies across three representative reasoning domains.
  • Figure 5: Impact of problem complexity on Reasoning Success. We report the performance of evaluated models on Sudoku, Jigsaw Puzzle, and Maze Navigation tasks across varying grid dimensions.
  • ...and 7 more figures