Table of Contents
Fetching ...

Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang

TL;DR

Jigsaw-Puzzles introduces a real-world, puzzle-inspired benchmark to diagnose spatial reasoning in vision-language models. It comprises 1,100 images and five tasks that progress from spatial perception to multi-step reasoning, with an automated data-generation pipeline. Across 24 VLMs, humans markedly outperform models, and reasoning-enhanced systems provide the largest gains, though open-source approaches lag behind proprietary ones. The work highlights a substantial and ongoing gap to human-level spatial cognition, and provides datasets and tools to foster further progress in grounded spatial reasoning for AI systems.

Abstract

Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.

Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models

TL;DR

Jigsaw-Puzzles introduces a real-world, puzzle-inspired benchmark to diagnose spatial reasoning in vision-language models. It comprises 1,100 images and five tasks that progress from spatial perception to multi-step reasoning, with an automated data-generation pipeline. Across 24 VLMs, humans markedly outperform models, and reasoning-enhanced systems provide the largest gains, though open-source approaches lag behind proprietary ones. The work highlights a substantial and ongoing gap to human-level spatial cognition, and provides datasets and tools to foster further progress in grounded spatial reasoning for AI systems.

Abstract

Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.

Paper Structure

This paper contains 12 sections, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Jigsaw-Puzzles example. While human participants effortlessly reconstruct the original spatial layout, all tested VLMs fail to recover the correct order.
  • Figure 2: Evaluation of VLMs on Jigsaw-Puzzles. The plot reports the accuracy of 8 representative VLMs on 5 tasks.
  • Figure 3: Task examples of Jigsaw-Puzzles. Note: the questions above are slightly simplified for clarity and brevity, and the blue option indicates the correct answer.
  • Figure 4: Dataset curation pipeline. Step 1 filters candidate images through expert-defined rules to build a spatial reasoning dataset. Step 2 uses automated templates to generate task-specific QA pairs from the curated images.
  • Figure 5: Task Similarity Heatmap. The heatmap illustrates the pairwise correlation between tasks in our benchmark, measured using Pearson correlation coefficients. Task names are abbreviated using the initials of each word (e.g., Missing Piece Selection → MPS). The suffixes _e and _h indicate the Easy and Hard settings, respectively.
  • ...and 13 more figures