Table of Contents
Fetching ...

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

Xianzheng Ma, Brandon Smart, Yash Bhalgat, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu

TL;DR

This paper surveys the rapidly growing field of 3D large language models by organizing representations, architectures, tasks, datasets, and evaluation into a coherent taxonomy. It highlights five roles for LLMs in 3D tasks: enhancing task performance, enabling multi-task learning, serving as multi-modal interfaces, powering embodied agents, and facilitating 3D generation. The authors identify progress since 2023 while underscoring core bottlenecks such as 3D data scarcity, representation trade-offs, and the lack of robust 3D grounded evaluation metrics. They propose directions toward 3D-centric pretraining, bidirectional alignment between 3D and language, and safer, more interpretable embodied systems. Overall, the work provides a roadmap for advancing spatial intelligence through integrated 3D data and language models, with practical guidance for researchers and practitioners.

Abstract

As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

TL;DR

This paper surveys the rapidly growing field of 3D large language models by organizing representations, architectures, tasks, datasets, and evaluation into a coherent taxonomy. It highlights five roles for LLMs in 3D tasks: enhancing task performance, enabling multi-task learning, serving as multi-modal interfaces, powering embodied agents, and facilitating 3D generation. The authors identify progress since 2023 while underscoring core bottlenecks such as 3D data scarcity, representation trade-offs, and the lack of robust 3D grounded evaluation metrics. They propose directions toward 3D-centric pretraining, bidirectional alignment between 3D and language, and safer, more interpretable embodied systems. Overall, the work provides a roadmap for advancing spatial intelligence through integrated 3D data and language models, with practical guidance for researchers and practitioners.

Abstract

As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
Paper Structure (45 sections, 5 figures, 4 tables)

This paper contains 45 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An overall timeline to show the developments of 3D representations (Sec. \ref{['sec:3d_representations']}), L(V)LMs (Sec. \ref{['sec:bkground_large_language_model']} and Sec. \ref{['sec:2D_VLM_VFM']}), 3D w/VLMs&VFMs(Sec. \ref{['sec:3d-vlms']}), 3D w/LLMs (Sec. \ref{['sec:3d-llms']}). This figure illustrates how the development of 3D representation and L(V)LMs inspired the 3D vision-language methods (3D w/LLMs, 3D w/VLMs VFMs). Besides, we can clearly see the rapid growth of 3D-LLM methods starting from 2023, which calls for the attention of more researchers.
  • Figure 2: Common task categories for 3D Vision Language models. Each category contains many types of queries, but we provide a single example from each category. The scan is from the ScanNet dataset dai2017scannet, and the 'question answering' and 'situated question answering' examples are adapted from ScanQA scanqa and SQA3D ma2022sqa3d respectively.
  • Figure 3: Architectures for aligning 3D with text for LLMs. Here we show four high-level architectures: (a) 3D-only model that aligns 3D features to the LLM's input space, (b) 3D+text model where 3D features and text are both aligned, (c) Q-Former style model where text is used during to condition the alignment of the 3D features, and optionally given to the LLM itself (dashed arrow), and (d) text-only approach which converts 3D representations into text strings, avoiding the need to train an alignment module.
  • Figure 4: Taxonomy of 3D with LLM methods. In Sec. \ref{['sec:3d-llms']}, we analyze the role LLMs have played in solving 3D tasks from five perspectives: Enhancing 3D Tasks, Multi-Task Learning, 3D Multi-modal Interfaces, Embodied Agents, and 3D Generation.
  • Figure 5: Timeline of datasets. A timeline showing how existing datasets are combined and annotated to form new datasets for 3D vision language tasks. Datasets in orange are foundational 3D datasets without language annotations and datasets in blue are the annotated datasets used in 3D vision language tasks. Note that WildRefer also introduces new 3D data and annotations for vision-language tasks.