Table of Contents
Fetching ...

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo

TL;DR

The paper identifies a fundamental mismatch between standard causal decoders and the order-agnostic nature of 3D scenes in multi-modal reasoning. It introduces 3D-SLIM, a plug-in, parameter-free masking strategy that replaces causal attention with a Geometry-adaptive Mask and an Instruction-aware Mask to enforce spatially grounded and instruction-guided interactions. Across multiple benchmarks and diverse LLM backbones, 3D-SLIM consistently improves object-centric 3D scene-language tasks, highlighting the importance of decoder design for spatial reasoning. The work demonstrates broad applicability and sets a foundation for more capable 3D multi-modal models in embodied AI and robotics.

Abstract

Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

TL;DR

The paper identifies a fundamental mismatch between standard causal decoders and the order-agnostic nature of 3D scenes in multi-modal reasoning. It introduces 3D-SLIM, a plug-in, parameter-free masking strategy that replaces causal attention with a Geometry-adaptive Mask and an Instruction-aware Mask to enforce spatially grounded and instruction-guided interactions. Across multiple benchmarks and diverse LLM backbones, 3D-SLIM consistently improves object-centric 3D scene-language tasks, highlighting the importance of decoder design for spatial reasoning. The work demonstrates broad applicability and sets a foundation for more capable 3D multi-modal models in embodied AI and robotics.

Abstract

Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

Paper Structure

This paper contains 28 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of Mask Designs and Impact on Performance. (a) Unlike conventional causal masking that reinforces the input order, 3D-SLIM introduces Geo Mask and Inst Mask to capture spatial structures and encode task-oriented object representations for enhanced 3D reasoning. (b) When integrated with various LLM decoders, 3D-SLIM consistently improves performance across multiple benchmarks.
  • Figure 2: Overview of 3D Scene-Language Understanding and an Object-Centric Framework. (a) Examples of 3D reasoning tasks, including visual grounding and dense captioning. Object identifiers (e.g., <OBJ000>) are utilized to reference and ground target objects during the conversation. (b) The object-centric 3D LLM consists of an input construction module and an LLM decoder. Unlike prior works focusing on input construction, we emphasize the decoder to enhance 3D reasoning.
  • Figure 3: Overview of 3D Spatial Language Instruction Mask (3D-SLIM). (a) Illustration of the input 3D point cloud and user instruction, which are subsequently encoded as object and instruction tokens. For simplicity, only a small subset of object and instruction tokens is depicted. (b) The Geometry-adaptive Mask dynamically modulates object-object attention based on spatial proximity and local density; objects in denser regions attend to an expanded set of neighbors, whereas those in sparse areas focus on fewer neighbors. (c) The Instruction-aware Mask enables object-instruction attention, which enhances scene understanding by focusing on instruction-relevant content (e.g., "chairs", "table").
  • Figure 4: Visualization of LLM attention map. In the instruction, key cues for answering the question are highlighted in yellow. The attention map is colored by activation intensity, where yellow represents high values and purple represents low ones. The green box in the second row denote the GT object, whereas the red ones indicate the predicted objects.
  • Figure 5: Attention map over 3D objects and instruction. For simplicity, the attention map is computed over all object and instruction tokens except the system tokens, where yellow indicates high values and purple indicates low ones. The green boxes labeled ① and ② correspond to the regions used for Obj$\rightarrow$Obj Attn and Obj$\rightarrow$Inst Attn. Obj$\rightarrow$Obj Attn displays where the reference object token focuses within the 3D scene. For visual grounding, the reference is the model-predicted object; for question answering, it is a manually identified target object. Gray points indicate background. Obj$\rightarrow$Inst Attn visualizes how the reference token attends to instruction tokens, with red and yellow indicating high and low activations.