3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o
Dingning Liu, Cheng Wang, Peng Gao, Renrui Zhang, Xinzhu Ma, Yuan Meng, Zhihui Wang
TL;DR
This work tackles the challenge of enabling 3D grounding and reasoning in multimodal LLMs without model fine-tuning. It introduces 3DAxisPrompt, which injects explicit 3D priors via a rendered 3D axis and SAM-derived masks into multi-view observations, enabling GPT-4o and other MLLMs to infer 3D positions, plan routes, and predict robot actions. Across indoor/outdoor localization, 3D grounding, route planning, and robotic tasks on datasets like ScanNet, ScanRefer, FMB, and nuScenes, the method yields meaningful gains and reveals that no single prompt format excels for all tasks; multi-view and well-constructed 2D/3D marks with axis cues provide the best results. The study underscores the feasibility of leveraging prompt engineering to extend 2D grounding into real-world 3D understanding, while also outlining limitations such as occlusion and the need for task-specific prompt configurations. Overall, 3DAxisPrompt offers a practical, zero-shot baseline toward integrating MLLMs with 3D vision and paves the way for future refinements.
Abstract
Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.
