Table of Contents
Fetching ...

SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

Yi Zhang, Youya Xia, Yong Wang, Meng Song, Xin Wu, Wenjun Wan, Bingbing Liu, AiXue Ye, Hongbo Zhang, Feng Wen

TL;DR

SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism, significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.

Abstract

While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.

SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning

TL;DR

SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism, significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.

Abstract

While Multimodal Large Language Models (MLLMs) excel in semantic tasks, they frequently lack the "spatial sense" essential for sophisticated geometric reasoning. Current models typically suffer from exorbitant modality-alignment costs and deficiency in fine-grained structural modeling precision.We introduce SSR, a framework designed for Structured Scene Reasoning that seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism. To minimize training overhead, our framework anchors 3D geometric features to the large language model's pre-aligned 2D visual semantics through cross-modal addition and token interleaving, effectively obviating the necessity for large-scale alignment pre-training. To underpin complex spatial reasoning, we propose a novel scene graph generation pipeline that represents global layouts as a chain of independent local triplets defined by relative coordinates. This is complemented by an incremental generation algorithm, enabling the model to construct "language-model-friendly" structural scaffolds for complex environments. Furthermore, we extend these capabilities to global-scale 3D global grounding task, achieving absolute metric precision across heterogeneous data sources. At a 7B parameter scale, SSR achieves state-of-the-art performance on multiple spatial intelligence benchmarks, notably scoring 73.9 on VSI-Bench. Our approach significantly outperforms much larger models, demonstrating that efficient feature alignment and structured scene reasoning are the cornerstones of authentic spatial intelligence.
Paper Structure (38 sections, 4 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 4 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of model performance on VSI-Bench. SSR achieves the highest accuracy among all proprietary and open-source competitors. Notably, our 7B model outperforms significantly larger models, demonstrating superior parameter efficiency in spatial reasoning.
  • Figure 2: Architecture of SSR-3D: It adopts a dual-branch architecture to jointly leverage 2D visual and 3D spatial cues. The 3D branch encodes geometric scene structure through dedicated spatial tokens, while the 2D branch processes image-derived visual features extracted by a vision encoder. Tokens from both branches are then interleaved and fused as input to the LLM, enabling unified multimodal reasoning over appearance and geometry.
  • Figure 3: Global scene graph representation via LocalCogMap. Left: Global Scene Graph: Our proposed framework maintains global connectivity while redefining triplets as localized spatial units. Right: LocalCogMap Construction: Each triplet is modeled within a $10 \times 10$ grid established by two anchors. The target object is then normalized within this frame. This formulation ensures geometric consistency across the entire scene graph.
  • Figure 4: MultiQA-based scene graph generation. We transform global scene graphs into independent triplets. For each triplet, the LLM infers target coordinates relative to two anchors within a structured system context. Compared to dense captions, this decoupled QA format ensures scalability to complex scenes and reduces computational redundancy.
  • Figure 5: The visualization of 7-DoF coordinates generation algorithm. As illustrated in the generation pipeline, we define the origin of the global coordinate system as the camera position in the first frame, and align the positive direction of the X-axis with the projection of the optical axis onto the ground plane. The visualizations of the 7-DoF grounding results demonstrate that our proposed coordinate definition is both geometrically clear and highly adaptable across diverse scenarios and datasets.
  • ...and 4 more figures