Table of Contents
Fetching ...

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, Chris Sweeney

TL;DR

EgoLifter addresses open-world 3D understanding from egocentric video by jointly reconstructing a scene with 3D Gaussians and lifting 2D segmentation priors from SAM into 3D through contrastive learning. A transient prediction module filters dynamic objects during reconstruction, yielding cleaner background geometry and more cohesive object features. The approach achieves state-of-the-art open-world 2D/3D segmentation on challenging egocentric data (e.g., ADT) and enables downstream tasks like 3D object extraction and scene editing without requiring 3D annotations. By leveraging differentiable feature rendering and weak supervision from 2D masks, EgoLifter scales to diverse, dynamic environments and holds promise for AR/VR perception in naturalistic settings.

Abstract

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

TL;DR

EgoLifter addresses open-world 3D understanding from egocentric video by jointly reconstructing a scene with 3D Gaussians and lifting 2D segmentation priors from SAM into 3D through contrastive learning. A transient prediction module filters dynamic objects during reconstruction, yielding cleaner background geometry and more cohesive object features. The approach achieves state-of-the-art open-world 2D/3D segmentation on challenging egocentric data (e.g., ADT) and enables downstream tasks like 3D object extraction and scene editing without requiring 3D annotations. By leveraging differentiable feature rendering and weak supervision from 2D masks, EgoLifter scales to diverse, dynamic environments and holds promise for AR/VR perception in naturalistic settings.

Abstract

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.
Paper Structure (34 sections, 4 equations, 9 figures, 6 tables)

This paper contains 34 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: EgoLifter solves 3D reconstruction and open-world segmentation simultaneously from egocentric videos. EgoLifter augments 3D Gaussian Splatting kerbl2023gaussiansplatting with instance features and lifts open-world 2D segmentation by contrastive learning, where 3D Gaussians belonging to the same objects are learned to have similar features. In this way, EgoLifter solves the multi-view mask association problem and establishes a consistent 3D representation that can be decomposed into object instances. EgoLifter enables multiple downstream applications including detection, segmentation, 3D object extraction and scene editing. See https://egolifter.github.io/ for animated visualizations.
  • Figure 2: Naive 3D reconstruction from egocentric videos creates a lot of "floaters" in the reconstruction and leads to blurry rendered images and erroneous instance features (bottom right). EgoLifter tackles this problem using a transient prediction network, which predicts a probability mask of transient objects in the image and guides the reconstruction process. In this way, EgoLifter gets a much cleaner reconstruction of the static background in both RGB and feature space (top right), which in turn leads to better object decomposition of 3D scenes.
  • Figure 3: RGB images and feature maps (colored by PCA) rendered by the EgoLifter Static baseline and EgoLifter. The predicted transient maps (Trans. map) from EgoLifter are also visualized, with red color indicating a high probability of being transient. Note that the baseline puts ghostly floaters on the region of transient objects, but EgoLifter filters them out and gives a cleaner reconstruction of both RGB images and feature maps. Rows 1-3 are from ADT, rows 4-5 from AEA, and rows 6-7 from Ego-Exo4D.
  • Figure 4: Rendered images and feature maps (visualised in PCA colors) by Gaussian Grouping ye2023gaussiangrouping and EgoLifter (Ours).
  • Figure 5: Individual 3D object can be extracted by querying or clustering over the 3D features from EgoLifter. Note object reconstructions are not perfect since each object might be partial observable in the egocentric videos rather than scanned intentionally.
  • ...and 4 more figures