EgoGen: An Egocentric Synthetic Data Generator

Gen Li; Kaifeng Zhao; Siwei Zhang; Xiaozhong Lyu; Mihai Dusmanu; Yan Zhang; Marc Pollefeys; Siyu Tang

EgoGen: An Egocentric Synthetic Data Generator

Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang

TL;DR

EgoGen tackles the data scarcity problem for egocentric perception by introducing a scalable synthetic data generator that renders realistic first-person views with rich ground-truth annotations. Central to EgoGen is an egocentric perception-driven motion synthesis framework built on collision-avoiding motion primitives (CAMPs) and a two-stage reinforcement learning loop that couples perception with motion, enabling dynamic, obstacle-rich environments without pre-defined global paths. The system supports a full data-generation pipeline (camera rigs, clothing, rendering, and annotations) and validates improvements across mapping/localization for HMDs, egocentric camera tracking, and human mesh recovery, while offering open-source release. Empirical results show that EgoGen-enhanced training improves state-of-the-art methods and can augment real-world datasets for broader egocentric vision tasks, underscoring the practical impact for AR/VR and robotics applications.

Abstract

Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models, but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge, we introduce EgoGen, a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach, our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works, our model eliminates the need for a pre-defined global path, and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline, we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views. EgoGen will be fully open-sourced, offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. Refer to our project page: https://ego-gen.github.io/.

EgoGen: An Egocentric Synthetic Data Generator

TL;DR

Abstract

Paper Structure (44 sections, 17 equations, 17 figures, 10 tables, 1 algorithm)

This paper contains 44 sections, 17 equations, 17 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Ego-Sensing Driven Motion Synthesis
Ego-Sensing Driven Motion Primitives
Training Collision-Avoiding Stochastic Policies
Compositing Learned Motion Primitives
Egocentric Synthetic Data Generation
Embodied Camera Placement
Body Texture and Clothing
Rendering and Annotations
Experiments
Evaluation of Learned CAMPs
Evaluation of Egocentric Sensing
Ablation Studies
Mapping, Localization, and Tracking for HMD
...and 29 more sections

Figures (17)

Figure 1: EgoGen: a scalable synthetic data generation system for egocentric perception tasks, with rich multi-modal data and accurate annotations. We simulate camera rigs for head-mounted devices (HMDs) and render from the perspective of the camera wearer with various sensors. Top to bottom: middle and right camera sensors in the rig. Left to right: photo-realistic RGB image, RGB with simulated motion blur, depth map, surface normal, segmentation mask, and world position for fisheye cameras widely used in HMDs.
Figure 2: Policy network architecture. We learn a generalizable mapping from motion seed body markers $\mathbf{X}_t^S$, marker directions $\mathbf{X}_t^{S^D}$, egocentric sensing $\mathcal{E}_t$, and distance to the target $d_t$ to CAMPs. The policy learns a stochastic collision avoiding action space to predict future body markers $\mathbf{X}_t^F$. For illustration purposes, we visualize only one frame of $\mathbf{X}_t^S$ and $\mathcal{E}_t$. See Sec. \ref{['sec:3-1']} and \ref{['sec:3-2']} for details.
Figure 3: Overview of EgoGen. Through generative motion synthesis (Sec. \ref{['sec:3']}), we further enhance egocentric data diversity by randomly sampling diverse body textures (ethnicity, gender) and 3D textured clothing through an automated clothing simulation pipeline (Sec. \ref{['sec:clothing']}). With high-quality scenes and different egocentric cameras, we can render photorealistic egocentric synthetic data with rich and accurate ground truth annotations (Sec. \ref{['sec:render']}).
Figure S1: The 2D projection of the egocentric camera location is represented by the purple point, while the 2D projection of the viewing direction $\vv{\mathbf{v}}$ is indicated by the red arrow. The field of view changes due to the head pose.
Figure S2: Failure cases.
...and 12 more figures

EgoGen: An Egocentric Synthetic Data Generator

TL;DR

Abstract

EgoGen: An Egocentric Synthetic Data Generator

Authors

TL;DR

Abstract

Table of Contents

Figures (17)