Table of Contents
Fetching ...

LLMI3D: MLLM-based 3D Perception from a Single 2D Image

Fan Yang, Sicheng Zhao, Yanhao Zhang, Hui Chen, Haonan Lu, Jungong Han, Guiguang Ding

TL;DR

LLMI3D addresses the challenge of robust 3D perception from a single image by integrating an MLLM with three 3D-focused innovations: Spatial-Enhanced Local Feature Mining for local spatial fidelity, 3D Query Token-Derived Info Decoding for precise 3D attribute regression, and Geometry Projection-Based 3D Reasoning to mitigate camera focal variability. It couples a parameter-efficient LoRA fine-tuned MLLM with a CNN/VIT-based image encoder and introduces IG3D, a fine-grained, caption- and VQA-enabled 3D grounding dataset derived from multiple major benchmarks. Across 3D grounding, open vocabulary grounding, domain generalization, Mono3DRefer, and prompt-type experiments, LLMI3D achieves state-of-the-art performance, demonstrating strong generalization and reasoning capabilities in 3D tasks. The work highlights practical gains for embodied intelligence while noting increased inference latency compared to specialized models, and points toward efficiency-focused future work.

Abstract

Recent advancements in autonomous driving, augmented reality, robotics, and embodied intelligence have necessitated 3D perception algorithms. However, current 3D perception methods, especially specialized small models, exhibit poor generalization in open scenarios. On the other hand, multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks, due to weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations. To address these challenges, we propose the following solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations. We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM. Additionally, we have constructed the IG3D dataset, which provides fine-grained descriptions and question-answer annotations. Extensive experiments demonstrate that our LLMI3D achieves state-of-the-art performance, outperforming other methods by a large margin.

LLMI3D: MLLM-based 3D Perception from a Single 2D Image

TL;DR

LLMI3D addresses the challenge of robust 3D perception from a single image by integrating an MLLM with three 3D-focused innovations: Spatial-Enhanced Local Feature Mining for local spatial fidelity, 3D Query Token-Derived Info Decoding for precise 3D attribute regression, and Geometry Projection-Based 3D Reasoning to mitigate camera focal variability. It couples a parameter-efficient LoRA fine-tuned MLLM with a CNN/VIT-based image encoder and introduces IG3D, a fine-grained, caption- and VQA-enabled 3D grounding dataset derived from multiple major benchmarks. Across 3D grounding, open vocabulary grounding, domain generalization, Mono3DRefer, and prompt-type experiments, LLMI3D achieves state-of-the-art performance, demonstrating strong generalization and reasoning capabilities in 3D tasks. The work highlights practical gains for embodied intelligence while noting increased inference latency compared to specialized models, and points toward efficiency-focused future work.

Abstract

Recent advancements in autonomous driving, augmented reality, robotics, and embodied intelligence have necessitated 3D perception algorithms. However, current 3D perception methods, especially specialized small models, exhibit poor generalization in open scenarios. On the other hand, multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks, due to weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations. To address these challenges, we propose the following solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations. We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM. Additionally, we have constructed the IG3D dataset, which provides fine-grained descriptions and question-answer annotations. Extensive experiments demonstrate that our LLMI3D achieves state-of-the-art performance, outperforming other methods by a large margin.
Paper Structure (27 sections, 18 equations, 7 figures, 11 tables)

This paper contains 27 sections, 18 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Our LLMI3D endows MLLMs with 3D perception capabilities. When provided with a question or description, our LLMI3D can return the object of interest and its 3D bounding box (bbox) in 3D space. Across various datasets, our LLMI3D significantly outperforms existing methods.
  • Figure 2: Three issues of vanilla MLLMs in 3D perception tasks: (a) Weak 3D local spatial object perception: MLLMs struggle with accurate 3D object localization due to poor spatial understanding, especially for distant or small objects. (b) Poor text-based geometric numerical output: Current models output 3D values in text, which is slow and error-prone. Our approach utilizes a learnable 3D Query token with 3D heads to regress geometric values, improving accuracy significantly. (c) Inability to handle camera focal variations: Distinguishing changes in camera focal length from a single 2D image is hard. This leads to incorrect depth predictions for similarly sized objects captured at different focal lengths.
  • Figure 3: Framework of the LLMI3D. (1) The image encoder utilizes Spatial-Enhanced Local Feature Mining, employing a CNN and depth predictor to extract local spatial enhanced features from high-resolution (HR) images. A ViT extracts global features with fewer tokens from low-resolution (LR) images, while spatial-enhanced cross-branch attention efficiently retrieves object spatial features and reduces the token count. (2) In the LLM, we propose 3D Query Token-Derived Info Decoding. We utilize a learnable 3D Query token to extract 3D features and employ 3D heads to regress the geometric attributes precisely. (3) To derive the 3D box of the object, we introduce geometry projection-based 3D Reasoning. Rather than using focal length-invisible black-box methods, we combine the network and geometric projection for 3D spatial reasoning, alleviating the errors introduced by varying camera focal lengths in 3D perception.
  • Figure 4: The auto label process in our IG3D dataset. For each image, we use visual prompting VisualGPT to add a 2D box around the target object. A pre-trained MLLM then generates descriptive captions. In this auto label prompt, {category_name} is replaced with the object's category name.
  • Figure 5: Examples of the 3D VQA of our LLMI3D in the IG3D-SUNRGBD-VQA dataset. Our LLMI3D is capable of understanding user-input personalized questions, leveraging common knowledge and logical reasoning to identify objects of interest, and returning the corresponding 3D bboxes.
  • ...and 2 more figures