Table of Contents
Fetching ...

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

Beining Xu, Siting Zhu, Zhao Jin, Junxian Li, Hesheng Wang

TL;DR

The paper tackles 3D Visual Grounding by enabling MLLMs to perform implicit 3D spatial reasoning through a Spatial Guidance Strategy that leverages feed-forward 3D reconstruction during training, paired with a Structure-Enhanced module that uses intra-view/inter-view attention and multi-level position encoding. This approach learns structure-aware representations without requiring point-cloud reconstruction at inference, yielding strong accuracy, efficiency, and generalization across ScanRefer, Nr3D, Sr3D and out-of-domain datasets. Key contributions include the reconstruction-guided implicit 3D understanding, a dual-view attention scheme, and explicit 3D position cues integrated into a unified MLLM framework, demonstrated to outperform existing methods with favorable efficiency. The work has practical significance for embodied AI and robotics, enabling robust 3D grounding with reduced computational overhead and better cross-view consistency.

Abstract

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S$^2$-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S$^2$-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance

TL;DR

The paper tackles 3D Visual Grounding by enabling MLLMs to perform implicit 3D spatial reasoning through a Spatial Guidance Strategy that leverages feed-forward 3D reconstruction during training, paired with a Structure-Enhanced module that uses intra-view/inter-view attention and multi-level position encoding. This approach learns structure-aware representations without requiring point-cloud reconstruction at inference, yielding strong accuracy, efficiency, and generalization across ScanRefer, Nr3D, Sr3D and out-of-domain datasets. Key contributions include the reconstruction-guided implicit 3D understanding, a dual-view attention scheme, and explicit 3D position cues integrated into a unified MLLM framework, demonstrated to outperform existing methods with favorable efficiency. The work has practical significance for embodied AI and robotics, enabling robust 3D grounding with reduced computational overhead and better cross-view consistency.

Abstract

3D Visual Grounding (3DVG) focuses on locating objects in 3D scenes based on natural language descriptions, serving as a fundamental task for embodied AI and robotics. Recent advances in Multi-modal Large Language Models (MLLMs) have motivated research into extending them to 3DVG. However, MLLMs primarily process 2D visual inputs and struggle with understanding 3D spatial structure of scenes solely from these limited perspectives. Existing methods mainly utilize viewpoint-dependent rendering of reconstructed point clouds to provide explicit structural guidance for MLLMs in 3DVG tasks, leading to inefficiency and limited spatial reasoning. To address this issue, we propose S-MLLM, an efficient framework that enhances spatial reasoning in MLLMs through implicit spatial reasoning. We introduce a spatial guidance strategy that leverages the structure awareness of feed-forward 3D reconstruction. By acquiring 3D structural understanding during training, our model can implicitly reason about 3D scenes without relying on inefficient point cloud reconstruction. Moreover, we propose a structure-enhanced module (SE), which first employs intra-view and inter-view attention mechanisms to capture dependencies within views and correspondences across views. The module further integrates multi-level position encoding to associate visual representations with spatial positions and viewpoint information, enabling more accurate structural understanding. Extensive experiments demonstrate that S-MLLM unifies superior performance, generalization, and efficiency, achieving significant performance over existing methods across the ScanRefer, Nr3D, and Sr3D datasets. Code will be available upon acceptance.

Paper Structure

This paper contains 27 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Comparison of previous methods and our method. (a) Previous methods typically reconstruct point clouds of 3D scenes explicitly and then render 2D images to obtain structure guidance. (b) Our method leverages spatial guidance to understand the 3D structure during training, allowing the model to perform implicit spatial reasoning in the latent space without requiring point-cloud reconstruction at inference.
  • Figure 2: The Framework of S$^2$-MLLM. Our model takes sampled multi-view RGB-D frames, related camera parameters, language descriptions, and the bounding boxes of object proposals as inputs. The shared video encoder and position encoder extract visual and geometric features from RGB-D frames. The structure-enhanced module (SE) integrates visual features and position information to form the visual input for the LLM. The Video LLM jointly processes the visual input and the tokenized query, enabling cross-modal understanding and reasoning. The grounding head predicts the 3D bounding box (Bbox) of the target object. The language head generates the category of the target object. while the reconstruction decoder predicts the point map to provide reconstruction supervision. Notice that the reconstruction is used only at training time.
  • Figure 3: Qualitative comparison of 3DVG results in challenging spatial understanding cases in 3DVG. Incorrect predictions are highlighted in magenta, correct ones in cyan, and key spatial relations are underlined. Specifically, S$^2$-MLLM (a) reasons with scene layout priors. (b) understands the specified viewpoint. (c) predicts through structure cues despite partial occlusion. (d) distinguishes similar objects by jointly reasoning over multiple spatial relationships. (e) accurately handles spatial relations involving multiple reference objects. (f) understands relative positioning. Overall, S$^2$-MLLM demonstrates more reliable spatial understanding than the previous method.
  • Figure 4: Error type analysis on ScanRefer chen2020scanrefer dataset.
  • Figure 5: Typical types of failure cases in Scanrefer chen2020scanrefer. Ground Truth is highlighted in green, our predictions in red. Key semantic information is in shadow.
  • ...and 4 more figures