Table of Contents
Fetching ...

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao, Junhao Zhong, Chuan Fang, Jia Zheng, Rui Tang, Hao Zhu, Ping Tan, Zihan Zhou

TL;DR

SpatialLM tackles structured indoor modeling by translating point-cloud inputs into textual scripts describing walls, doors, windows, and 3D object boxes using an Encoder-MLP-LLM pipeline. It introduces a large synthetic SpatialLM dataset to study multimodal alignment and evaluates on layout estimation and 3D object detection, achieving state-of-the-art results for layout on Structured3D and competitive results on ScanNet, with notable zero-shot video robustness. The work demonstrates a viable path for leveraging LLMs to reason about 3D scenes and enables potential applications in AR and embodied robotics. It also highlights avenues for future work, including open-vocabulary SUS and VQA tasks to broaden scene understanding capabilities.

Abstract

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

SpatialLM: Training Large Language Models for Structured Indoor Modeling

TL;DR

SpatialLM tackles structured indoor modeling by translating point-cloud inputs into textual scripts describing walls, doors, windows, and 3D object boxes using an Encoder-MLP-LLM pipeline. It introduces a large synthetic SpatialLM dataset to study multimodal alignment and evaluates on layout estimation and 3D object detection, achieving state-of-the-art results for layout on Structured3D and competitive results on ScanNet, with notable zero-shot video robustness. The work demonstrates a viable path for leveraging LLMs to reason about 3D scenes and enables potential applications in AR and embodied robotics. It also highlights avenues for future work, including open-vocabulary SUS and VQA tasks to broaden scene understanding capabilities.

Abstract

SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.

Paper Structure

This paper contains 22 sections, 1 equation, 15 figures, 12 tables.

Figures (15)

  • Figure 1: The overall pipeline of SpatialLM. Given the point cloud input, it employs a standard "Encoder-MLP-LLM" architecture for multimodal feature alignment (left), and generates structured scene descriptions in pure text form as output (middle). The reconstructed 3D structure is further overlaid on the point cloud for visualization (right).
  • Figure 2: Definition of our structured representation for layouts and objects.
  • Figure 3: Dataset visual quality comparison. The layout and object placements in ProcTHOR ProcTHOR and ASE SceneScript are program-generated, which exhibit noticeable differences from real-world statistics. The scenes in HSSD HSSD and our dataset are fully human-authored. But HSSD only has 211 scenes.
  • Figure 4: Qualitative results on Structured3D dataset.
  • Figure 5: Qualitative results on ScanNet dataset.
  • ...and 10 more figures