3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination
Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai
TL;DR
The paper addresses the scarcity of large-scale, densely-grounded 3D-language data for embodied AI. It introduces 3D-GRAND, a 40K-scene, 6.2M-annotation dataset that densely grounds language to 3D scenes, and 3D-POPE, a benchmark suite to quantify object hallucination in 3D-LLMs. Through experiments with a LoRA-finetuned Llama-2 model, it demonstrates that dense grounding and larger synthetic datasets significantly improve grounding accuracy and reduce hallucinations, with promising sim-to-real transfer to real-world scans. The work provides a scalable path for building reliable 3D-LLMs and establishes resources for robust evaluation and comparison across models.
Abstract
The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io
