Table of Contents
Fetching ...

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

Junjie Zhang, Tianci Hu, Xiaoshui Huang, Yongshun Gong, Dan Zeng

TL;DR

3DBench addresses the lack of scalable evaluation for 3D-LLMs by introducing a ten-task multimodal benchmark spanning object- to scene-level perception and navigation, evaluated with three metrics. It couples the benchmark with a large-scale automatic instruction-tuning dataset (over 0.23M QA pairs) generated through a two-step pipeline using Ai2Thor and GPT. Experiments across zero-shot and retraining scenarios show that 3DBench enhances evaluation robustness and reveals that dataset scale and model adaptation significantly affect performance, while highlighting persistent gaps in spatial reasoning and dialogue quality. The work provides a practical platform to assess and guide the development of 3D-LLMs and high-quality 3D instruction-tuning resources.

Abstract

Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions.

3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset

TL;DR

3DBench addresses the lack of scalable evaluation for 3D-LLMs by introducing a ten-task multimodal benchmark spanning object- to scene-level perception and navigation, evaluated with three metrics. It couples the benchmark with a large-scale automatic instruction-tuning dataset (over 0.23M QA pairs) generated through a two-step pipeline using Ai2Thor and GPT. Experiments across zero-shot and retraining scenarios show that 3DBench enhances evaluation robustness and reveals that dataset scale and model adaptation significantly affect performance, while highlighting persistent gaps in spatial reasoning and dialogue quality. The work provides a practical platform to assess and guide the development of 3D-LLMs and high-quality 3D instruction-tuning resources.

Abstract

Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions.
Paper Structure (17 sections, 9 figures, 3 tables)

This paper contains 17 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Zero-shot evaluation of three state-of-the-art 3D-LLMs on proposed 3DBench with ten multi-modal tasks.
  • Figure 2: The current task overview of 3DBench. 3DBench comprehensively addresses the complexity of both spatial and logical aspects, categorizing ten individual tasks into three levels. Future tasks can seamlessly integrate into this framework.
  • Figure 3: Overview of the 3DBench benchmark, encompassing ten 3D computer vision tasks and metrics from three perspectives, including traditional accuracy, IOU metric, GPT scores, and the novel path loss metric introduced by us.
  • Figure 4: Pipeline 3for generating the dataset. We are able to automatically collect instruction-tuning data for all detailed tasks in 3DBench. Each data sample comprises a point cloud (scene or object) along with the corresponding task dialogue.
  • Figure 5: The illustration of path loss. It is the distance accumulation of each endpoint on the longer trajectory (between GT and prediction) and its nearest neighbors on the other one.
  • ...and 4 more figures