ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi; Runpei Dong; Shaochen Zhang; Haoran Geng; Chunrui Han; Zheng Ge; Li Yi; Kaisheng Ma

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma

TL;DR

ShapeLLM introduces a universal 3D multimodal LLM for embodied interaction by pairing a scaled 3D point-cloud encoder (ReCon++) with a LLaMA-based language model. It tackles data scarcity through GPT-4V-generated instruction-following data and a new 3D MM-Vet benchmark that assesses recognition, knowledge generation, spatial reasoning, and embodied planning. Key contributions include ReCon++ with multi-view distillation, an instruction-tuned 3D vision-language framework, and a challenging benchmark showing strong 3D geometry understanding and embodied capabilities. The results demonstrate state-of-the-art performance in 3D representation transfer and robust multimodal comprehension, highlighting potential for real-world robotic reasoning and interaction.

Abstract

This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. ShapeLLM is built upon an improved 3D encoder by extending ReCon to ReCon++ that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing ReCon++ as the 3D point cloud input encoder for LLMs, ShapeLLM is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. ReCon++ and ShapeLLM achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding. Project page: https://qizekun.github.io/shapellm/

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

TL;DR

Abstract

Paper Structure (34 sections, 7 equations, 17 figures, 14 tables)

This paper contains 34 sections, 7 equations, 17 figures, 14 tables.

Introduction
ShapeLLM
Overall Architecture
How to alleviate interactive 3D understanding Data Desert?
ReCon++: Scaling Up 3D Representation Learning
3D MM-Vet: Benchmarking 3D Comprehension
Experiments
3D Representation Transferring with ReCon++
Multimodal Comprehension with ShapeLLM
Discussions
Is ShapeLLM grounded in physical worlds?
Can ShapeLLM generalize to unseen objects?
Related Works
Conclusions
Additional Experiments
...and 19 more sections

Figures (17)

Figure 1: Demonstrations of ShapeLLM and ReCon++. We present ShapeLLM, the first 3D LLM designed for embodied interaction and spatial intelligence.
Figure 3: Qualitative visualization of the instruction-following and 3D MM-Vet data.
Figure 4: Qualitative examples of the embodied interaction data.
Figure 5: Selected multimodal dialogue examples.ShapeLLM possesses robust capabilities in knowledge representation, reasoning, and instruction-following dialogue. With its powerful point cloud encoder ReCon++, ShapeLLM can even make accurate predictions about minute interactive components, e.g., handle. The rendered mesh images are solely for visual reference here and do not constitute input data.
Figure 6: Zero-shot 3D multimodal comprehension of robustness on 3D MM-Vet-C. Clean: no corruptions. Single-View: randomly select a camera viewpoint within the unit sphere and generate a single viewpoint within the FoV on polar coordinates. Jitter: Gaussian jittering with noise $\epsilon\sim\mathcal{N}(0,\sigma^2)$ and $\sigma=0.01$. Rotate: random SO(3) rotation sampling over X-Y-Z Euler angle $(\alpha,\beta,\gamma)\sim \mathcal{U}(-\theta,\theta)$ and $\theta=\pi/6$.
...and 12 more figures

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

TL;DR

Abstract

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (17)