Table of Contents
Fetching ...

EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Guangming Shi, Licheng Jiao

TL;DR

EgoSplat tackles open-vocabulary egocentric scene understanding by marrying language-embedded 3D Gaussian Splatting with SAM2-guided multi-view consistency and an instance-aware spatial-temporal transient predictor. The approach aggregates high-quality, cross-view features for each instance and suppresses transient artifacts to achieve robust open-vocabulary localization and segmentation. Empirical results on ADT and HOI4D show state-of-the-art improvements in localization accuracy and segmentation IoU, demonstrating strong performance under occlusions and dynamic interactions. This work advances open-vocabulary 3D scene understanding in egocentric settings and enables more natural language-based querying and interaction with dynamic environments.

Abstract

Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.

EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting

TL;DR

EgoSplat tackles open-vocabulary egocentric scene understanding by marrying language-embedded 3D Gaussian Splatting with SAM2-guided multi-view consistency and an instance-aware spatial-temporal transient predictor. The approach aggregates high-quality, cross-view features for each instance and suppresses transient artifacts to achieve robust open-vocabulary localization and segmentation. Empirical results on ADT and HOI4D show state-of-the-art improvements in localization accuracy and segmentation IoU, demonstrating strong performance under occlusions and dynamic interactions. This work advances open-vocabulary 3D scene understanding in egocentric settings and enables more natural language-based querying and interaction with dynamic environments.

Abstract

Egocentric scenes exhibit frequent occlusions, varied viewpoints, and dynamic interactions compared to typical scene understanding tasks. Occlusions and varied viewpoints can lead to multi-view semantic inconsistencies, while dynamic objects may act as transient distractors, introducing artifacts into semantic feature modeling. To address these challenges, we propose EgoSplat, a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding. A multi-view consistent instance feature aggregation method is designed to leverage the segmentation and tracking capabilities of SAM2 to selectively aggregate complementary features across views for each instance, ensuring precise semantic representation of scenes. Additionally, an instance-aware spatial-temporal transient prediction module is constructed to improve spatial integrity and temporal continuity in predictions by incorporating spatial-temporal associations across multi-view instances, effectively reducing artifacts in the semantic reconstruction of egocentric scenes. EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets, outperforming existing methods with a 8.2% improvement in localization accuracy and a 3.7% improvement in segmentation mIoU on the ADT dataset, and setting a new benchmark in open-vocabulary egocentric scene understanding. The code will be made publicly available.

Paper Structure

This paper contains 22 sections, 15 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Pipeline of EgoSplat. Given a sequence of egocentric video frames, (a) SAM2 performs video segmentation to obtain associated segments for each instance across frames. (b) A multi-view consistent instance feature aggregation module that selects high-quality views is employed to extract precisely language features. (c) For dynamic objects, an instance-aware spatial-temporal transient prediction module is designed to achieve transient prediction with temporal continuity and spatial completeness. We then train 3D Gaussians using consistent instance features with dynamic objects filtered out. (d) During querying, the similarity between rendered language embeddings and text embeddings enables open-vocabulary localization and segmentation through natural language interaction.
  • Figure 2: Instance-aware spatial-temporal transient prediction. (a) RGB Frame. (b) Initial transient map generated by a 2D transient prediction network. (c) Rendered features using the initial transient map, where incomplete masks introduce artifacts. (d) Video segments from SAM2. (e) Refined transient map using video segments, resulting in masks with improved edge definition. (f) Rendered feature using the refined transient map.
  • Figure 3: Typical frames from Scene 2 in the ADT dataset. Egocentric scenes are characterized by a narrow field of view, various camera coverage, and frequent human-object interactions.
  • Figure 4: Comparison of the visualization of open-vocabulary localization on the ADT dataset. We selected "tv stand" and "coaster" for visualization. The red points are the model predictions and the black dashed bounding boxes denote the annotations. It can be observed that our algorithm achieves the most accurate localization with the clearest boundaries.
  • Figure 5: Comparison of the visualization of open-vocabulary segmentation on the ADT dataset. We selected several categories from two scenes for demonstration. It can be observed that our algorithm achieves the most accurate segmentation masks.
  • ...and 6 more figures