Table of Contents
Fetching ...

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

TL;DR

The paper addresses the scarcity of large-scale, densely-grounded 3D-language data for embodied AI. It introduces 3D-GRAND, a 40K-scene, 6.2M-annotation dataset that densely grounds language to 3D scenes, and 3D-POPE, a benchmark suite to quantify object hallucination in 3D-LLMs. Through experiments with a LoRA-finetuned Llama-2 model, it demonstrates that dense grounding and larger synthetic datasets significantly improve grounding accuracy and reduce hallucinations, with promising sim-to-real transfer to real-world scans. The work provides a scalable path for building reliable 3D-LLMs and establishes resources for robust evaluation and comparison across models.

Abstract

The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

TL;DR

The paper addresses the scarcity of large-scale, densely-grounded 3D-language data for embodied AI. It introduces 3D-GRAND, a 40K-scene, 6.2M-annotation dataset that densely grounds language to 3D scenes, and 3D-POPE, a benchmark suite to quantify object hallucination in 3D-LLMs. Through experiments with a LoRA-finetuned Llama-2 model, it demonstrates that dense grounding and larger synthetic datasets significantly improve grounding accuracy and reduce hallucinations, with promising sim-to-real transfer to real-world scans. The work provides a scalable path for building reliable 3D-LLMs and establishes resources for robust evaluation and comparison across models.

Abstract

The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io
Paper Structure (26 sections, 10 figures, 11 tables)

This paper contains 26 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: 3D-GRAND dataset and statistics. (Left): 3D-GRAND is a large-scale, densely-grounded 3D-text dataset with 8 different tasks. (Right): From 40K 3D scenes, 3D-GRAND annotates 6.2M 3D-text pairs.
  • Figure 2: 3D-GRAND Data Curation Pipeline.
  • Figure 3: Data scaling analysis on zero-shot, sim-to-real grounding capability, and hallucination. Grounding performance (left two subfigures) consistently improves as data scales up. Model trained with densely-grounded data exhibits better grounding capability compared to that trained without. Additionally (right subfigure), the model hallucinates less when exposed to more data from 3D-GRAND. Here, the Hallucination Rate is calculated as $(1 - \text{Precision})$ on 3D-POPE.
  • Figure 4: 3D-GRAND model input and output on Grounded Object Reference task.
  • Figure 5: Demo of interactive chat interface with the 3D-GRAND model.
  • ...and 5 more figures