Table of Contents
Fetching ...

What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models

Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, Weidong Chen

TL;DR

The paper surveys classical and neural 3D scene representations for robotics, comparing point clouds, voxels, SDFs, scene graphs, NeRF, 3D Gaussian Splatting, and tokenizer-based foundation models across perception, mapping, localization, manipulation, and navigation. It analyzes trade-offs in memory efficiency, photorealism, and geometric fidelity, and discusses how foundation models could unify disparate representations into end-to-end robotic systems. Key contributions include a comprehensive taxonomy, trend analysis of neural representations, and (in progress) an open-source GitHub resource compiling relevant works. The work highlights the shift toward open-vocabulary, language-grounded, and token-based scene understanding while identifying data, real-time, and scalability challenges that must be addressed for practical deployment.

Abstract

In this paper, we provide a comprehensive overview of existing scene representation methods for robotics, covering traditional representations such as point clouds, voxels, signed distance functions (SDF), and scene graphs, as well as more recent neural representations like Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and the emerging Foundation Models. While current SLAM and localization systems predominantly rely on sparse representations like point clouds and voxels, dense scene representations are expected to play a critical role in downstream tasks such as navigation and obstacle avoidance. Moreover, neural representations such as NeRF, 3DGS, and foundation models are well-suited for integrating high-level semantic features and language-based priors, enabling more comprehensive 3D scene understanding and embodied intelligence. In this paper, we categorized the core modules of robotics into five parts (Perception, Mapping, Localization, Navigation, Manipulation). We start by presenting the standard formulation of different scene representation methods and comparing the advantages and disadvantages of scene representation across different modules. This survey is centered around the question: What is the best 3D scene representation for robotics? We then discuss the future development trends of 3D scene representations, with a particular focus on how the 3D Foundation Model could replace current methods as the unified solution for future robotic applications. The remaining challenges in fully realizing this model are also explored. We aim to offer a valuable resource for both newcomers and experienced researchers to explore the future of 3D scene representations and their application in robotics. We have published an open-source project on GitHub and will continue to add new works and technologies to this project.

What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models

TL;DR

The paper surveys classical and neural 3D scene representations for robotics, comparing point clouds, voxels, SDFs, scene graphs, NeRF, 3D Gaussian Splatting, and tokenizer-based foundation models across perception, mapping, localization, manipulation, and navigation. It analyzes trade-offs in memory efficiency, photorealism, and geometric fidelity, and discusses how foundation models could unify disparate representations into end-to-end robotic systems. Key contributions include a comprehensive taxonomy, trend analysis of neural representations, and (in progress) an open-source GitHub resource compiling relevant works. The work highlights the shift toward open-vocabulary, language-grounded, and token-based scene understanding while identifying data, real-time, and scalability challenges that must be addressed for practical deployment.

Abstract

In this paper, we provide a comprehensive overview of existing scene representation methods for robotics, covering traditional representations such as point clouds, voxels, signed distance functions (SDF), and scene graphs, as well as more recent neural representations like Neural Radiance Fields (NeRF), 3D Gaussian Splatting (3DGS), and the emerging Foundation Models. While current SLAM and localization systems predominantly rely on sparse representations like point clouds and voxels, dense scene representations are expected to play a critical role in downstream tasks such as navigation and obstacle avoidance. Moreover, neural representations such as NeRF, 3DGS, and foundation models are well-suited for integrating high-level semantic features and language-based priors, enabling more comprehensive 3D scene understanding and embodied intelligence. In this paper, we categorized the core modules of robotics into five parts (Perception, Mapping, Localization, Navigation, Manipulation). We start by presenting the standard formulation of different scene representation methods and comparing the advantages and disadvantages of scene representation across different modules. This survey is centered around the question: What is the best 3D scene representation for robotics? We then discuss the future development trends of 3D scene representations, with a particular focus on how the 3D Foundation Model could replace current methods as the unified solution for future robotic applications. The remaining challenges in fully realizing this model are also explored. We aim to offer a valuable resource for both newcomers and experienced researchers to explore the future of 3D scene representations and their application in robotics. We have published an open-source project on GitHub and will continue to add new works and technologies to this project.

Paper Structure

This paper contains 21 sections, 11 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: We summarize the development timeline of 3D scene representations in robotics, including point clouds, voxels, meshes, surfels, scene graphs, signed distance fields (SDF), and the most recent representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Furthermore, we categorize these representations based on their applications across different robotic modules, including mapping, SLAM, localization, planning, manipulation, and simulation. This categorization provides a comprehensive perspective on how different scene representations serve the diverse functional requirements of modern robotic systems. The images of the subfigures are sourced from octomaplin2020sdfdeepsdfchang2023hydrawang2025cvpr-vggt.
  • Figure 2: We analyze the trend in the number of papers within the community related to neural scene representation in robotics from Web of Science, including NeRF, 3DGS, and foundation models. We can observe that, over time, the focus has gradually shifted from NeRF to 3DGS and ultimately towards foundation models.
  • Figure 3: The structure and application of the 3D scene representation for Robotics: 3D Scene Representation ($\S$\ref{['sec:general']}), Perception ($\S$\ref{['sec:perception']}), Mapping and Localization module ($\S$\ref{['sec:mapping']}), and Interaction module ($\S$\ref{['sec:interaction']}). Subplot(a-h) are extracted from grounded_3d_llm_2024octreegsfast-livohumanoidsrt,
  • Figure 4: We compare various scene representations across several dimensions, including data form, continuity, memory usage, photorealism, flexibility, and geometric representation capability. Exact values differ, but the overall trend is indicative and remains comparable across methods.
  • Figure 5: A Taxonomy and future work of neural scene representation: NeRFNeRF, 3DGS 3dgs and volumetric ellipsoids rendering ever.
  • ...and 2 more figures