ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma
TL;DR
ShapeLLM introduces a universal 3D multimodal LLM for embodied interaction by pairing a scaled 3D point-cloud encoder (ReCon++) with a LLaMA-based language model. It tackles data scarcity through GPT-4V-generated instruction-following data and a new 3D MM-Vet benchmark that assesses recognition, knowledge generation, spatial reasoning, and embodied planning. Key contributions include ReCon++ with multi-view distillation, an instruction-tuned 3D vision-language framework, and a challenging benchmark showing strong 3D geometry understanding and embodied capabilities. The results demonstrate state-of-the-art performance in 3D representation transfer and robust multimodal comprehension, highlighting potential for real-world robotic reasoning and interaction.
Abstract
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/
