Table of Contents
Fetching ...

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, Yue Wang

TL;DR

The Large Spatial Model (LSM) is presented, which processes unposed RGB images directly into semantic radiance fields and achieves real-time semantic 3D reconstruction for the first time.

Abstract

Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

TL;DR

The Large Spatial Model (LSM) is presented, which processes unposed RGB images directly into semantic radiance fields and achieves real-time semantic 3D reconstruction for the first time.

Abstract

Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.

Paper Structure

This paper contains 30 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Large Spatial Model takes two unposed images as input and reconstructs an explicit radiance field, capturing geometry, appearance, and semantics in real time. This yields high performance in versatile tasks such as view synthesis, depth prediction, and open-vocabulary 3D segmentation.
  • Figure 2: Network Architecture. Our method utilizes input images from which pixel-aligned point maps are regressed using a generic Transformer. A set of semantic anitrosopic 3D Gaussians incorporating geometry, appearance, and semantics are then predicted employing another point-based Transformer that facilitates local context aggregation and hierarchical fusion. It is supervised end-to-end, minimizing the loss function through comparisons against ground truth and rasterized label maps on new views. During the inference stage, our approach is capable of predicting the scene representation without requiring camera parameters, enabling real-time semantic 3D reconstruction.
  • Figure 3: Visualization of the 3D Feature Field. We present examples of features rendered from novel viewpoints, illustrating how our method converts 2D features into a consistent 3D, facilitating versatile and efficient segmentation. Visualizations are generated using PCA pedregosa2011scikit.
  • Figure 4: Novel-View Synthesis (NVS) Comparisons. We evaluate scene-level reconstruction by comparing our method to approaches that require per-scene optimization, such as NeRF-DFF and Feature-3DGS, which predicts both RGB and segmentation, and the generalizable 3D Gaussian Splatting method (pixelSplat). Notably, these methods require a pre-processing step to obtain camera poses using off-the-shelf SfM. Through end-to-end, data-driven training, our method achieves comparable visual quality to these approaches while reconstructing the 3D radiance field in a single feed-forward pass.
  • Figure 5: Language-based 3D Segmentation Comparison. We visualize the segmentation results across four unseen scenes and observe that our method performs comparably to NeRF-DFF and Feature-3DGS. This indicates that LSM effectively lifts 2D feature maps into high-quality 3D feature fields.