Table of Contents
Fetching ...

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

Jirong Zha, Yuxuan Fan, Xiao Yang, Chen Gao, Xinlei Chen

TL;DR

3D-LLMs survey the integration of language models with three-dimensional spatial understanding, categorizing methods into image-based, point-cloud-based, and hybrid modalities. It reviews data representations, architectural adaptations, and training strategies that bridge textual and 3D information, and discusses current challenges such as data scarcity, representation gaps, and computational costs. The paper highlights key advances across input modalities and alignment strategies, and outlines future directions in 3D perception, multi-modal fusion, cross-scene generalization, and open-vocabulary evaluation. This work aims to guide researchers and accelerate practical deployment of 3D-aware LLMs in robotics, healthcare, design, and related fields.

Abstract

3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

TL;DR

3D-LLMs survey the integration of language models with three-dimensional spatial understanding, categorizing methods into image-based, point-cloud-based, and hybrid modalities. It reviews data representations, architectural adaptations, and training strategies that bridge textual and 3D information, and discusses current challenges such as data scarcity, representation gaps, and computational costs. The paper highlights key advances across input modalities and alignment strategies, and outlines future directions in 3D perception, multi-modal fusion, cross-scene generalization, and open-vocabulary evaluation. This work aims to guide researchers and accelerate practical deployment of 3D-aware LLMs in robotics, healthcare, design, and related fields.

Abstract

3D spatial understanding is essential in real-world applications such as robotics, autonomous vehicles, virtual reality, and medical imaging. Recently, Large Language Models (LLMs), having demonstrated remarkable success across various domains, have been leveraged to enhance 3D understanding tasks, showing potential to surpass traditional computer vision methods. In this survey, we present a comprehensive review of methods integrating LLMs with 3D spatial understanding. We propose a taxonomy that categorizes existing methods into three branches: image-based methods deriving 3D understanding from 2D visual data, point cloud-based methods working directly with 3D representations, and hybrid modality-based methods combining multiple data streams. We systematically review representative methods along these categories, covering data representations, architectural modifications, and training strategies that bridge textual and 3D modalities. Finally, we discuss current limitations, including dataset scarcity and computational challenges, while highlighting promising research directions in spatial perception, multi-modal fusion, and real-world applications.

Paper Structure

This paper contains 26 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Large Language Models can acquire 3D spatial reasoning capabilities through various input sources including multi-view images, RGB-D images, point clouds, and hybrid modalities, enabling the processing and understanding of three-dimensional information.
  • Figure 2: A Taxonomy of Models for Spatial Reasoning with LLMs: Image-based, Point Cloud-based, and Hybrid Modality-based Approaches and Their Subdivisions.
  • Figure 3: An overview of image-based approaches.
  • Figure 4: An overview of point cloud-based approaches.
  • Figure 5: An overview of hybrid modality-based approaches.