Table of Contents
Fetching ...

3D Spatial Understanding in MLLMs: Disambiguation and Evaluation

Chun-Peng Chang, Alain Pagani, Didier Stricker

TL;DR

This work tackles the problem of contextually localizing and disambiguating a target object in 3D scenes for multimodal LLMs used in robotics. It proposes a four-stage pipeline that explicitly identifies distractors, employs relative position encoding, and uses random/ambiguous anchors with a two-stage loss to train the model, validated against 3D visual grounding and standard Sr3D/Nr3D benchmarks. The approach achieves state-of-the-art performance on Nr3D/Sr3D and demonstrates that synthetic data can effectively train 3D visual grounding models, improving spatial understanding beyond traditional sentence similarity. These findings advance practical, spatially precise human-robot communication in complex 3D environments while outlining limitations and future research directions.

Abstract

Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model's ability to localize and disambiguate target objects. Our approach not only achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity, but also demonstrates improved 3D spatial understanding through 3D visual grounding model.

3D Spatial Understanding in MLLMs: Disambiguation and Evaluation

TL;DR

This work tackles the problem of contextually localizing and disambiguating a target object in 3D scenes for multimodal LLMs used in robotics. It proposes a four-stage pipeline that explicitly identifies distractors, employs relative position encoding, and uses random/ambiguous anchors with a two-stage loss to train the model, validated against 3D visual grounding and standard Sr3D/Nr3D benchmarks. The approach achieves state-of-the-art performance on Nr3D/Sr3D and demonstrates that synthetic data can effectively train 3D visual grounding models, improving spatial understanding beyond traditional sentence similarity. These findings advance practical, spatially precise human-robot communication in complex 3D environments while outlining limitations and future research directions.

Abstract

Multimodal Large Language Models (MLLMs) have made significant progress in tasks such as image captioning and question answering. However, while these models can generate realistic captions, they often struggle with providing precise instructions, particularly when it comes to localizing and disambiguating objects in complex 3D environments. This capability is critical as MLLMs become more integrated with collaborative robotic systems. In scenarios where a target object is surrounded by similar objects (distractors), robots must deliver clear, spatially-aware instructions to guide humans effectively. We refer to this challenge as contextual object localization and disambiguation, which imposes stricter constraints than conventional 3D dense captioning, especially regarding ensuring target exclusivity. In response, we propose simple yet effective techniques to enhance the model's ability to localize and disambiguate target objects. Our approach not only achieves state-of-the-art performance on conventional metrics that evaluate sentence similarity, but also demonstrates improved 3D spatial understanding through 3D visual grounding model.

Paper Structure

This paper contains 22 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Common mistakes by MLLMs for precise target-exclusive localization and disambiguation. (a) depicts a hallucination error where the model refers to an absent object. In (b), the chosen anchor is ambiguous, making localization difficult. (c) illustrates an unsuitable anchor object choice that does not facilitate localize the target. (d) shows a scenario where the model selects an appropriate anchor object but fails to provide a correct spatial description.
  • Figure 2: Our training and evaluation pipeline consists of four key steps: visual input encoding, VLM token encoding, LLM generation, and evaluation. The input to the system includes a point cloud and a target ID, which typically comes from upstream tasks such as robot assistants or human-robot teaching interactions, could be a bounding box or just an ID number. In the visual input encoding step, we identify distractors based on the point cloud features of the target object and encode the relative spatial relationships between the target, distractors, and potential anchors. Various token encoding techniques are then applied. During evaluation, we assess the quality of the generated spatial instructions by measuring both sentence similarity and deeper understanding of 3D spatial comprehension.
  • Figure 3: Comparison of captions generated by different language models. Both Vote2Cap++ and our model generate sentences with structures highly similar to the reference sentence. However, the key difference lies in the spatial understanding: the sentences generated by our proposed method demonstrate a better grasp of the spatial configuration within the scene, even when a different anchor is chosen. This shows that our approach helps the model develop a deeper understanding of the 3D scene, rather than merely mimicking the surface-level sentence structure.