Table of Contents
Fetching ...

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen

TL;DR

M3DBench introduces a large-scale, multi-modal 3D instruction-following dataset that unifies region- and scene-level tasks and interleaves text with coordinate, image, and 3D prompts. The paper grounds a simple baseline that couples a 3D scene perceiver, a multi-modal instruction encoder, and a frozen LLM, trained by updating only projection layers. Extensive experiments across DC, VQA, EQA, MR, and EP demonstrate the dataset’s utility and reveal strengths and limitations of current 3D MLM baselines, guiding future improvements in 3D scene understanding, reasoning, and embodied planning. By offering dataset, evaluation protocols, and baselines, M3DBench aims to accelerate development of versatile 3D multimodal models for real-world applications.

Abstract

Recently, 3D understanding has become popular to facilitate autonomous agents to perform further decisionmaking. However, existing 3D datasets and methods are often limited to specific tasks. On the other hand, recent progress in Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated exceptional general language and imagery tasking performance. Therefore, it is interesting to unlock MLM's potential to be 3D generalist for wider tasks. However, current MLMs' research has been less focused on 3D tasks due to a lack of large-scale 3D instruction-following datasets. In this work, we introduce a comprehensive 3D instructionfollowing dataset called M3DBench, which possesses the following characteristics: 1) It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments. 3) It is a large-scale 3D instruction-following dataset with over 320k instruction-response pairs. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts. Extensive experiments demonstrate the effectiveness of our dataset and baseline, supporting general 3D-centric tasks, which can inspire future research.

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

TL;DR

M3DBench introduces a large-scale, multi-modal 3D instruction-following dataset that unifies region- and scene-level tasks and interleaves text with coordinate, image, and 3D prompts. The paper grounds a simple baseline that couples a 3D scene perceiver, a multi-modal instruction encoder, and a frozen LLM, trained by updating only projection layers. Extensive experiments across DC, VQA, EQA, MR, and EP demonstrate the dataset’s utility and reveal strengths and limitations of current 3D MLM baselines, guiding future improvements in 3D scene understanding, reasoning, and embodied planning. By offering dataset, evaluation protocols, and baselines, M3DBench aims to accelerate development of versatile 3D multimodal models for real-world applications.

Abstract

Recently, 3D understanding has become popular to facilitate autonomous agents to perform further decisionmaking. However, existing 3D datasets and methods are often limited to specific tasks. On the other hand, recent progress in Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated exceptional general language and imagery tasking performance. Therefore, it is interesting to unlock MLM's potential to be 3D generalist for wider tasks. However, current MLMs' research has been less focused on 3D tasks due to a lack of large-scale 3D instruction-following datasets. In this work, we introduce a comprehensive 3D instructionfollowing dataset called M3DBench, which possesses the following characteristics: 1) It supports general multimodal instructions interleaved with text, images, 3D objects, and other visual prompts. 2) It unifies diverse 3D tasks at both region and scene levels, covering a variety of fundamental abilities in real-world 3D environments. 3) It is a large-scale 3D instruction-following dataset with over 320k instruction-response pairs. Furthermore, we establish a new benchmark for assessing the performance of large models in understanding multi-modal 3D prompts. Extensive experiments demonstrate the effectiveness of our dataset and baseline, supporting general 3D-centric tasks, which can inspire future research.
Paper Structure (28 sections, 4 equations, 15 figures, 17 tables)

This paper contains 28 sections, 4 equations, 15 figures, 17 tables.

Figures (15)

  • Figure 1: The statistics of the M3DBench. (a) The distribution of instructions based on the first word, where the inner circle of the graph represents the frequency of the first word's occurrence, and the outer circle shows the frequency of verbs and nouns appearing in the instructions corresponding to that first word. (b) The word cloud of responses. (c) The distribution of instruction length. (d) The distribution of response length.
  • Figure 2: Overview of our baseline model. We utilize scene perceiver to extract scene tokens from 3D visual input. Multi-modal instructions are transformed into corresponding instruction tokens via their respective encoders. The scene tokens and multi-modal instruction tokens are then concatenated and fed into a frozen LLM, which generates the corresponding responses subsequently. During the training process, only the projectors are updated.
  • Figure 3: Qualitative Results. We provide visualization results on various 3D-centric tasks in diverse 3D environments. Orange highlights the wrong answer.
  • Figure 4: Examples of 3D object detection. The left column represents the 3D scene, the middle column displays the instructions, and the right column shows the annotations for the object detection task. We save annotations in textual format and for visualization purposes here, we extract the bounding boxes from the text.
  • Figure 5: Examples of 3D visual grounding. The left column represents the 3D scene, the middle column displays the instructions, and the right column shows the annotations for the visual grounding. M3DBench includes interleaved multi-modal instructions, and the annotations extend beyond annotating a single target object, encompassing the identification of multiple objects.
  • ...and 10 more figures