Table of Contents
Fetching ...

EgoSim: An Egocentric Multi-view Simulator and Real Dataset for Body-worn Cameras during Motion and Activity

Dominik Hollidt, Paul Streli, Jiaxi Jiang, Yasaman Haghighi, Changlin Qian, Xintong Liu, Christian Holz

TL;DR

This work addresses the gap in egocentric vision research by focusing on body-worn cameras distributed across the wearer's body. It introduces EgoSim, a configurable multi-view simulator built on Unreal Engine/AirSim that renders multiple modalities and realistic motion artifacts via spring-damper camera attachments, enabled by real motion capture data. It also presents MultiEgoView, a dataset combining 119 hours of synthetic footage from six body-worn cameras with ground-truth 3D poses and 5 hours of real-world GoPro data with Xsens poses, plus accelerometer/gyroscope readings. A Vision Transformer-based baseline demonstrates the benefits of synthetic data for sim-to-real transfer in end-to-end 3D ego-pose estimation, showing that pretraining on synthetic data and fine-tuning on real data substantially improves pose accuracy. Together, EgoSim and MultiEgoView offer a practical platform and dataset to advance learning and evaluation for body-worn, multi-view egocentric perception, with potential to extend to depth, semantics, and inertial cues.

Abstract

Research on egocentric tasks in computer vision has mostly focused on head-mounted cameras, such as fisheye cameras or embedded cameras inside immersive headsets. We argue that the increasing miniaturization of optical sensors will lead to the prolific integration of cameras into many more body-worn devices at various locations. This will bring fresh perspectives to established tasks in computer vision and benefit key areas such as human motion tracking, body pose estimation, or action recognition -- particularly for the lower body, which is typically occluded. In this paper, we introduce EgoSim, a novel simulator of body-worn cameras that generates realistic egocentric renderings from multiple perspectives across a wearer's body. A key feature of EgoSim is its use of real motion capture data to render motion artifacts, which are especially noticeable with arm- or leg-worn cameras. In addition, we introduce MultiEgoView, a dataset of egocentric footage from six body-worn cameras and ground-truth full-body 3D poses during several activities: 119 hours of data are derived from AMASS motion sequences in four high-fidelity virtual environments, which we augment with 5 hours of real-world motion data from 13 participants using six GoPro cameras and 3D body pose references from an Xsens motion capture suit. We demonstrate EgoSim's effectiveness by training an end-to-end video-only 3D pose estimation network. Analyzing its domain gap, we show that our dataset and simulator substantially aid training for inference on real-world data. EgoSim code & MultiEgoView dataset: https://siplab.org/projects/EgoSim

EgoSim: An Egocentric Multi-view Simulator and Real Dataset for Body-worn Cameras during Motion and Activity

TL;DR

This work addresses the gap in egocentric vision research by focusing on body-worn cameras distributed across the wearer's body. It introduces EgoSim, a configurable multi-view simulator built on Unreal Engine/AirSim that renders multiple modalities and realistic motion artifacts via spring-damper camera attachments, enabled by real motion capture data. It also presents MultiEgoView, a dataset combining 119 hours of synthetic footage from six body-worn cameras with ground-truth 3D poses and 5 hours of real-world GoPro data with Xsens poses, plus accelerometer/gyroscope readings. A Vision Transformer-based baseline demonstrates the benefits of synthetic data for sim-to-real transfer in end-to-end 3D ego-pose estimation, showing that pretraining on synthetic data and fine-tuning on real data substantially improves pose accuracy. Together, EgoSim and MultiEgoView offer a practical platform and dataset to advance learning and evaluation for body-worn, multi-view egocentric perception, with potential to extend to depth, semantics, and inertial cues.

Abstract

Research on egocentric tasks in computer vision has mostly focused on head-mounted cameras, such as fisheye cameras or embedded cameras inside immersive headsets. We argue that the increasing miniaturization of optical sensors will lead to the prolific integration of cameras into many more body-worn devices at various locations. This will bring fresh perspectives to established tasks in computer vision and benefit key areas such as human motion tracking, body pose estimation, or action recognition -- particularly for the lower body, which is typically occluded. In this paper, we introduce EgoSim, a novel simulator of body-worn cameras that generates realistic egocentric renderings from multiple perspectives across a wearer's body. A key feature of EgoSim is its use of real motion capture data to render motion artifacts, which are especially noticeable with arm- or leg-worn cameras. In addition, we introduce MultiEgoView, a dataset of egocentric footage from six body-worn cameras and ground-truth full-body 3D poses during several activities: 119 hours of data are derived from AMASS motion sequences in four high-fidelity virtual environments, which we augment with 5 hours of real-world motion data from 13 participants using six GoPro cameras and 3D body pose references from an Xsens motion capture suit. We demonstrate EgoSim's effectiveness by training an end-to-end video-only 3D pose estimation network. Analyzing its domain gap, we show that our dataset and simulator substantially aid training for inference on real-world data. EgoSim code & MultiEgoView dataset: https://siplab.org/projects/EgoSim

Paper Structure

This paper contains 31 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (Left) Our dataset MultiEgoView contains 5 hours of egocentric real-world footage from 6 body-worn GoPro cameras and ground-truth 3D body poses from an Xsens motion capture suit as well as 119 hours of simulated footage in high-fidelity virtual environments on the basis of real motion capture data and associated 3D body poses. (Right) Our method estimates ego poses from video data alone, here visualized inside the scanned 3D scene.
  • Figure 2: EgoSim renders multiple modalities: (a) RGB, (b) depth, (c) normals, (d) semantic labels.
  • Figure 3: Example RGB renders produced by EgoSim and included in our MultiEgoView dataset. Qualitatively, the simulated scan (d, e, f) and real data (g, h, i) look similar. Both simulated scenes (Scene 1: a, b, c; Scene 2: d, e, f) offer high-fidelity environments. The pelvis provides a stable view of the environment, whereas wrist and knee cameras typically move quickly and capture artifacts.
  • Figure 4: Visualization of our results obtained from our multi-view egocentric pose estimator on real-world data. The change of color denotes different timestamps.
  • Figure 5: An excerpt of our example images from 24 locations across 4 scenes.
  • ...and 1 more figures