Table of Contents
Fetching ...

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

Hongpei Zheng, Shijie Li, Yanran Li, Hujun Yin

TL;DR

The paper tackles the challenge of large-scale 3D scene understanding by introducing H^2U3D, a house-scale VQA dataset with multi-floor environments and chain-of-thought annotations, and SpatialReasoner, an active perception framework that interactively explores scenes via spatial tools. It adopts a two-stage training pipeline—supervised cold-start followed by GRPO-based reinforcement learning with an adaptive exploration reward—to encourage efficient yet thorough exploration. Empirical results on H^2U3D show state-of-the-art performance, outperforming strong baselines with substantially fewer input images, demonstrating the value of coarse-to-fine active exploration. The work advances spatial reasoning for embodied AI by enabling scalable, interpretable reasoning over large 3D spaces and providing a foundation for real-world house-scale perception tasks.

Abstract

Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce H$^2$U3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. H$^2$U3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m$^2$. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on H$^2$U3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

TL;DR

The paper tackles the challenge of large-scale 3D scene understanding by introducing H^2U3D, a house-scale VQA dataset with multi-floor environments and chain-of-thought annotations, and SpatialReasoner, an active perception framework that interactively explores scenes via spatial tools. It adopts a two-stage training pipeline—supervised cold-start followed by GRPO-based reinforcement learning with an adaptive exploration reward—to encourage efficient yet thorough exploration. Empirical results on H^2U3D show state-of-the-art performance, outperforming strong baselines with substantially fewer input images, demonstrating the value of coarse-to-fine active exploration. The work advances spatial reasoning for embodied AI by enabling scalable, interpretable reasoning over large 3D spaces and providing a foundation for real-world house-scale perception tasks.

Abstract

Spatial reasoning in large-scale 3D environments remains challenging for current vision-language models, which are typically constrained to room-scale scenarios. We introduce HU3D (Holistic House Understanding in 3D), a 3D visual question answering dataset designed for house-scale scene understanding. HU3D features multi-floor environments spanning up to three floors and 10-20 rooms, covering more than 300 m. Through an automated annotation pipeline, it constructs hierarchical coarse-to-fine visual representations and generates diverse question-answer pairs with chain-of-thought annotations. We further propose SpatialReasoner, an active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. SpatialReasoner is trained through a two-stage strategy: a supervised cold start followed by reinforcement learning with an adaptive exploration reward that promotes efficient exploration while discouraging redundant operations. Extensive experiments demonstrate that SpatialReasoner achieves state-of-the-art performance on HU3D, outperforming strong baselines including GPT-4o and Gemini-2.5-Pro. Notably, our method attains superior results while using only 3-4 images in total on average, compared to baselines requiring 16+ images, highlighting the effectiveness of our coarse-to-fine active exploration paradigm.

Paper Structure

This paper contains 22 sections, 11 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Two-stage H$^2$U3D dataset construction pipeline. Stage 1 (QA Collection): Multi-floor BEVs and random tool calls are processed by Gemini-2.5 Pro to generate question-answer pairs. Stage 2 (CoT Generation): The same visual inputs and tool sequences are used to produce detailed chain-of-thought reasoning annotations.
  • Figure 2: SpatialReasoner inference process for 3D visual question answering. The model performs hierarchical exploration from top-down view to focused regions, then renders first-person views to locate target objects and answer questions about spatial scenes.
  • Figure 3: H$^2$U3D spatial scale compared with ScanQA and question type analysis