EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Julian Straub; Daniel DeTone; Tianwei Shen; Nan Yang; Chris Sweeney; Richard Newcombe

EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Julian Straub, Daniel DeTone, Tianwei Shen, Nan Yang, Chris Sweeney, Richard Newcombe

TL;DR

This work defines 3D Egocentric Foundation Models (EFMs) and introduces the EFM3D benchmark to measure progress on two core 3D tasks—3D object detection and surface regression—using high-quality egocentric Project Aria data. It proposes Egocentric Voxel Lifting (EVL), a universal 3D backbone that lifts frozen 2D foundation features and semi-dense points into a gravity-aligned 3D voxel grid processed by a 3D U-Net, enabling robust 3D reasoning over sequences. The authors release synthetic ASE annotations and real-world AEO/ADT data, and demonstrate that EVL trained on ASE generalizes to real data and outperforms existing 3D scene understanding methods, while also introducing mechanisms for persisting predictions over time via ObbTracker and surface fusion. These results establish a benchmark and a strong baseline for 3D EFMs, highlighting the potential of egocentric 3D priors and pointing toward future work in dynamic scene understanding and user interaction modeling.

Abstract

The advent of wearable computers enables a new source of context for AI that is embedded in egocentric sensor data. This new egocentric data comes equipped with fine-grained 3D location information and thus presents the opportunity for a novel class of spatial foundation models that are rooted in 3D space. To measure progress on what we term Egocentric Foundation Models (EFMs) we establish EFM3D, a benchmark with two core 3D egocentric perception tasks. EFM3D is the first benchmark for 3D object detection and surface regression on high quality annotated egocentric data of Project Aria. We propose Egocentric Voxel Lifting (EVL), a baseline for 3D EFMs. EVL leverages all available egocentric modalities and inherits foundational capabilities from 2D foundation models. This model, trained on a large simulated dataset, outperforms existing methods on the EFM3D benchmark.

EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

TL;DR

Abstract

Paper Structure (29 sections, 5 equations, 18 figures, 9 tables)

This paper contains 29 sections, 5 equations, 18 figures, 9 tables.

Introduction
Related Work
Dataset Contributions
Egocentric Voxel Lifting (EVL)
3D Bounding Box Detection
3D Surface Regression
Implementation and Training Details
The EFM3D Benchmark and Experiments
3D Bounding Box Detection and Persistence
3D Surface Estimation and Reconstruction
Limitations and Societal Impact
Conclusion
Acknowledgements
Model and Training Details
EVL Model Details
...and 14 more sections

Figures (18)

Figure 1: 3D Egocentric Foundation Models leverage spatial priors from egocentric data to power core 3D tasks such as 3D object detection and reconstruction.
Figure 2: EVL lifts 2D features extracted from frozen image foundation models into an local gravity-aligned 3D voxel grid of features. After concatenating point masks a U-Net processes the 3D feature volume before running 3D CNN heads. These heads regress target parameters such as 3D OBBs and occupancy values.
Figure 3: We visualize (c) maximum linear (region in green) and (d) ScanNet linear (region in red), as well as the original fisheye (with a valid region in yellow) camera models for (a) ASE and (d) ADT.
Figure 4: ASE validation scenes overlaid with camera trajectories and the 3D OBB predictions from EVL and ImVoxelNet rukhovich2022imvoxelnet.
Figure 5: Aria Everyday Object (AEO) scenes overlaid with the camera trajectories (blue) and the colored 3D bounding boxes from Ground Truth (left), EVL (middle), and ImVoxelNet (right).
...and 13 more figures

EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

TL;DR

Abstract

EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)