EgoGen: An Egocentric Synthetic Data Generator
Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang
TL;DR
EgoGen tackles the data scarcity problem for egocentric perception by introducing a scalable synthetic data generator that renders realistic first-person views with rich ground-truth annotations. Central to EgoGen is an egocentric perception-driven motion synthesis framework built on collision-avoiding motion primitives (CAMPs) and a two-stage reinforcement learning loop that couples perception with motion, enabling dynamic, obstacle-rich environments without pre-defined global paths. The system supports a full data-generation pipeline (camera rigs, clothing, rendering, and annotations) and validates improvements across mapping/localization for HMDs, egocentric camera tracking, and human mesh recovery, while offering open-source release. Empirical results show that EgoGen-enhanced training improves state-of-the-art methods and can augment real-world datasets for broader egocentric vision tasks, underscoring the practical impact for AR/VR and robotics applications.
Abstract
Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models, but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge, we introduce EgoGen, a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach, our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works, our model eliminates the need for a pre-defined global path, and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline, we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views. EgoGen will be fully open-sourced, offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. Refer to our project page: https://ego-gen.github.io/.
