Table of Contents
Fetching ...

EgoGen: An Egocentric Synthetic Data Generator

Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang

TL;DR

EgoGen tackles the data scarcity problem for egocentric perception by introducing a scalable synthetic data generator that renders realistic first-person views with rich ground-truth annotations. Central to EgoGen is an egocentric perception-driven motion synthesis framework built on collision-avoiding motion primitives (CAMPs) and a two-stage reinforcement learning loop that couples perception with motion, enabling dynamic, obstacle-rich environments without pre-defined global paths. The system supports a full data-generation pipeline (camera rigs, clothing, rendering, and annotations) and validates improvements across mapping/localization for HMDs, egocentric camera tracking, and human mesh recovery, while offering open-source release. Empirical results show that EgoGen-enhanced training improves state-of-the-art methods and can augment real-world datasets for broader egocentric vision tasks, underscoring the practical impact for AR/VR and robotics applications.

Abstract

Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models, but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge, we introduce EgoGen, a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach, our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works, our model eliminates the need for a pre-defined global path, and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline, we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views. EgoGen will be fully open-sourced, offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. Refer to our project page: https://ego-gen.github.io/.

EgoGen: An Egocentric Synthetic Data Generator

TL;DR

EgoGen tackles the data scarcity problem for egocentric perception by introducing a scalable synthetic data generator that renders realistic first-person views with rich ground-truth annotations. Central to EgoGen is an egocentric perception-driven motion synthesis framework built on collision-avoiding motion primitives (CAMPs) and a two-stage reinforcement learning loop that couples perception with motion, enabling dynamic, obstacle-rich environments without pre-defined global paths. The system supports a full data-generation pipeline (camera rigs, clothing, rendering, and annotations) and validates improvements across mapping/localization for HMDs, egocentric camera tracking, and human mesh recovery, while offering open-source release. Empirical results show that EgoGen-enhanced training improves state-of-the-art methods and can augment real-world datasets for broader egocentric vision tasks, underscoring the practical impact for AR/VR and robotics applications.

Abstract

Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models, but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge, we introduce EgoGen, a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach, our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works, our model eliminates the need for a pre-defined global path, and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline, we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views. EgoGen will be fully open-sourced, offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. Refer to our project page: https://ego-gen.github.io/.
Paper Structure (44 sections, 17 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 44 sections, 17 equations, 17 figures, 10 tables, 1 algorithm.

Figures (17)

  • Figure 1: EgoGen: a scalable synthetic data generation system for egocentric perception tasks, with rich multi-modal data and accurate annotations. We simulate camera rigs for head-mounted devices (HMDs) and render from the perspective of the camera wearer with various sensors. Top to bottom: middle and right camera sensors in the rig. Left to right: photo-realistic RGB image, RGB with simulated motion blur, depth map, surface normal, segmentation mask, and world position for fisheye cameras widely used in HMDs.
  • Figure 2: Policy network architecture. We learn a generalizable mapping from motion seed body markers $\mathbf{X}_t^S$, marker directions $\mathbf{X}_t^{S^D}$, egocentric sensing $\mathcal{E}_t$, and distance to the target $d_t$ to CAMPs. The policy learns a stochastic collision avoiding action space to predict future body markers $\mathbf{X}_t^F$. For illustration purposes, we visualize only one frame of $\mathbf{X}_t^S$ and $\mathcal{E}_t$. See Sec. \ref{['sec:3-1']} and \ref{['sec:3-2']} for details.
  • Figure 3: Overview of EgoGen. Through generative motion synthesis (Sec. \ref{['sec:3']}), we further enhance egocentric data diversity by randomly sampling diverse body textures (ethnicity, gender) and 3D textured clothing through an automated clothing simulation pipeline (Sec. \ref{['sec:clothing']}). With high-quality scenes and different egocentric cameras, we can render photorealistic egocentric synthetic data with rich and accurate ground truth annotations (Sec. \ref{['sec:render']}).
  • Figure S1: The 2D projection of the egocentric camera location is represented by the purple point, while the 2D projection of the viewing direction $\vv{\mathbf{v}}$ is indicated by the red arrow. The field of view changes due to the head pose.
  • Figure S2: Failure cases.
  • ...and 12 more figures