SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin
TL;DR
SIRI-Bench addresses the gap in evaluating visual models on spatially grounded, structurally complex reasoning by introducing a large-scale benchmark of 9,078 video-based solid geometry problems rendered in realistic 3D scenes. It couples a multi-agent Automatic Scene Creation Engine with geometry problems to synthesize faithful scenes and videos, enabling scalable data generation. Across a broad set of VLMs and human baselines, the study shows that current models struggle with extracting spatial information and performing multi-step geometric reasoning from video, even when given explicit textual conditions. The work highlights a critical bottleneck in spatially grounded multimodal reasoning and provides a practical path forward for improving VLMs through integrated spatial perception and structured reasoning.
Abstract
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
