Table of Contents
Fetching ...

Memorize What Matters: Emergent Scene Decomposition from Multitraverse

Yiming Li, Zehong Wang, Yue Wang, Zhiding Yu, Zan Gojcic, Marco Pavone, Chen Feng, Jose M. Alvarez

TL;DR

The paper tackles how to memorize persistent elements of a 3D environment while forgetting transient objects using only camera data across multiple traversals. It introduces 3D Gaussian Mapping (3DGM), a self-supervised framework that leverages robust differentiable rendering with 3D Gaussian Splatting (3DGS) to produce an environmental Gaussian map (EnvGS) and, simultaneously, 2D ephemerality segmentation (EmerSeg). The approach uses SfM for initialization, robust feature distillation, and feature residuals mining to identify transient pixels, enabling environment-object decomposition without human labels or LiDAR. A new Mapverse benchmark (from Ithaca365 and nuPlan) evaluates unsupervised 2D segmentation, 3D reconstruction, and neural rendering, showing strong performance across diverse driving scenarios. The results demonstrate effective LiDAR-free, self-supervised mapping with practical implications for autonomous driving, scene understanding, and offline autolabeling, while identifying avenues for handling extreme conditions and large-scale scenes.

Abstract

Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust differentiable rendering problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 2D segmentation and 3D mapping without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of our method for self-driving and robotics.

Memorize What Matters: Emergent Scene Decomposition from Multitraverse

TL;DR

The paper tackles how to memorize persistent elements of a 3D environment while forgetting transient objects using only camera data across multiple traversals. It introduces 3D Gaussian Mapping (3DGM), a self-supervised framework that leverages robust differentiable rendering with 3D Gaussian Splatting (3DGS) to produce an environmental Gaussian map (EnvGS) and, simultaneously, 2D ephemerality segmentation (EmerSeg). The approach uses SfM for initialization, robust feature distillation, and feature residuals mining to identify transient pixels, enabling environment-object decomposition without human labels or LiDAR. A new Mapverse benchmark (from Ithaca365 and nuPlan) evaluates unsupervised 2D segmentation, 3D reconstruction, and neural rendering, showing strong performance across diverse driving scenarios. The results demonstrate effective LiDAR-free, self-supervised mapping with practical implications for autonomous driving, scene understanding, and offline autolabeling, while identifying avenues for handling extreme conditions and large-scale scenes.

Abstract

Humans naturally retain memories of permanent elements, while ephemeral moments often slip through the cracks of memory. This selective retention is crucial for robotic perception, localization, and mapping. To endow robots with this capability, we introduce 3D Gaussian Mapping (3DGM), a self-supervised, camera-only offline mapping framework grounded in 3D Gaussian Splatting. 3DGM converts multitraverse RGB videos from the same region into a Gaussian-based environmental map while concurrently performing 2D ephemeral object segmentation. Our key observation is that the environment remains consistent across traversals, while objects frequently change. This allows us to exploit self-supervision from repeated traversals to achieve environment-object decomposition. More specifically, 3DGM formulates multitraverse environmental mapping as a robust differentiable rendering problem, treating pixels of the environment and objects as inliers and outliers, respectively. Using robust feature distillation, feature residuals mining, and robust optimization, 3DGM jointly performs 2D segmentation and 3D mapping without human intervention. We build the Mapverse benchmark, sourced from the Ithaca365 and nuPlan datasets, to evaluate our method in unsupervised 2D segmentation, 3D reconstruction, and neural rendering. Extensive results verify the effectiveness and potential of our method for self-driving and robotics.
Paper Structure (86 sections, 5 equations, 34 figures, 5 tables, 1 algorithm)

This paper contains 86 sections, 5 equations, 34 figures, 5 tables, 1 algorithm.

Figures (34)

  • Figure 1: A high-level diagram of 3D Gaussian Mapping (3DGM). Given multitraverse RGB videos, 3DGM outputs a Gaussian-based environment map (EnvGS) and 2D ephemerality segmentation (EmerSeg) for the input images. Note that the proposed framework is LiDAR-free and self-supervised.
  • Figure 1: Mean IoU of unsupervised vs. five supervised methods in Mapverse-Ithaca365.$^*$ indicates the model without training on our dataset.
  • Figure 2: An overall illustration of 3DGM. Given RGB camera observations collected at different times, we use COLMAP to obtain the camera poses and initial Gaussian points. Then we utilize splatting-based rasterization to render both RGB images and robust features from the environmental Gaussians. We further leverage feature residuals to extract the object masks by mining spatial information of the residuals. Finally, we utilize the ephemerality masks to finetune the 3D Gaussians.
  • Figure 3: IoU at 20 locations in Ithaca, NY.
  • Figure 4: Qualitative evaluations of EmerSeg in Mapverse-Ithaca365.
  • ...and 29 more figures