Table of Contents
Fetching ...

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin

TL;DR

SIRI-Bench addresses the gap in evaluating visual models on spatially grounded, structurally complex reasoning by introducing a large-scale benchmark of 9,078 video-based solid geometry problems rendered in realistic 3D scenes. It couples a multi-agent Automatic Scene Creation Engine with geometry problems to synthesize faithful scenes and videos, enabling scalable data generation. Across a broad set of VLMs and human baselines, the study shows that current models struggle with extracting spatial information and performing multi-step geometric reasoning from video, even when given explicit textual conditions. The work highlights a critical bottleneck in spatially grounded multimodal reasoning and provides a practical path forward for improving VLMs through integrated spatial perception and structured reasoning.

Abstract

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

TL;DR

SIRI-Bench addresses the gap in evaluating visual models on spatially grounded, structurally complex reasoning by introducing a large-scale benchmark of 9,078 video-based solid geometry problems rendered in realistic 3D scenes. It couples a multi-agent Automatic Scene Creation Engine with geometry problems to synthesize faithful scenes and videos, enabling scalable data generation. Across a broad set of VLMs and human baselines, the study shows that current models struggle with extracting spatial information and performing multi-step geometric reasoning from video, even when given explicit textual conditions. The work highlights a critical bottleneck in spatially grounded multimodal reasoning and provides a practical path forward for improving VLMs through integrated spatial perception and structured reasoning.

Abstract

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

Paper Structure

This paper contains 35 sections, 11 figures.

Figures (11)

  • Figure 1: Spatial Intelligence with Complex Reasoning. Conventional benchmarks focus on text-grounded reasoning (upper-left), where reasoning is limited to text alone. In contrast, this work introduces spatial-grounded reasoning (lower-left), where key conditions are implicitly embedded in realistic 3D scenes and presented as videos. Solving these problems requires interleaved textual reasoning and spatial perception (right col.), challenging VLMs' spatial reasoning ability.
  • Figure 2: The Transformation Process from an original math problem to a 3D Spatial Representation. The given math problem is decomposed into five components and processed individually. First, the main entity’s dimensions are solved, and corresponding bpy code generates the 3D scene, later rendered as a video. Second, problem conditions are refined by removing information meant to be inferred from the scene, and node indices are replaced with color markers. Finally, the answer is adjusted for scaling effects to produce the final answer.
  • Figure 3: Data Samples in SIRI-Bench. This figure presents several samples from our SIRI-Bench dataset, along with their original questions and intermediate steps. As can be consistently observed, our data generation engine accurately solves for geometric conditions, replaces vertex indices, processes textual conditions, and computes numerical answers, demonstrating its reliability.
  • Figure 4: Performance of Existing VLMs. This figure shows the error distributions across seven intervals ranging from 0% to 200% for all baseline methods on the SIRI-Bench. A higher concentration of errors in the lower intervals (i.e. brighter colors) indicates better performance in problem-solving. The method labeled 'Textual Rep.' refers to an LLM that accesses full mathematical conditions through textual descriptions rather than videos of 3D scenes. Overall, the results reveal the limitations of current VLMs in spatial grounded reasoning.
  • Figure 5: Ablation on Problem Representation. This figure compares the accuracy of two sibling models using textual representation versus 3D spatial representation as input. Three columns depict three pairs of sibling LLMs/VLMs. This comparison disentangles structural reasoning from spatial perception, revealing that existing VLMs struggle to effectively extract spatial information when solving complex visual problems.
  • ...and 6 more figures