Table of Contents
Fetching ...

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik

TL;DR

The paper tackles the challenge of 3D human motion capture from egocentric event streams using a fisheye monocular camera. It introduces EventEgo3D (EE3D), an end-to-end neural pipeline that converts high-temporal-resolution event data into 3D poses via a two-stage architecture: an Egocentric Pose Module (EPM) for 2D heatmap estimation and 3D lifting, and a Residual Event Propagation Module (REPM) that emphasizes wearer-related events and propagates past information. EE3D builds and uses two datasets, EE3D-S (synthetic) and EE3D-R (real), enabling training and evaluation for this new modality, and demonstrates real-time performance at $140$ Hz with superior 3D accuracy, particularly in challenging, fast-motion scenarios. The work provides a hardware-prototype head-mounted setup and extensive ablations, showing that event-based egocentric vision can surpass RGB-based approaches under varying illumination and motion conditions, with strong potential for mobile, low-power HMD applications.

Abstract

Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions, which can be restricting in many applications involving head-mounted devices. In response to the existing limitations, this paper 1) introduces a new problem, i.e., 3D human motion capture from an egocentric monocular event camera with a fisheye lens, and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation, enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz.

EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams

TL;DR

The paper tackles the challenge of 3D human motion capture from egocentric event streams using a fisheye monocular camera. It introduces EventEgo3D (EE3D), an end-to-end neural pipeline that converts high-temporal-resolution event data into 3D poses via a two-stage architecture: an Egocentric Pose Module (EPM) for 2D heatmap estimation and 3D lifting, and a Residual Event Propagation Module (REPM) that emphasizes wearer-related events and propagates past information. EE3D builds and uses two datasets, EE3D-S (synthetic) and EE3D-R (real), enabling training and evaluation for this new modality, and demonstrates real-time performance at Hz with superior 3D accuracy, particularly in challenging, fast-motion scenarios. The work provides a hardware-prototype head-mounted setup and extensive ablations, showing that event-based egocentric vision can surpass RGB-based approaches under varying illumination and motion conditions, with strong potential for mobile, low-power HMD applications.

Abstract

Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions, which can be restricting in many applications involving head-mounted devices. In response to the existing limitations, this paper 1) introduces a new problem, i.e., 3D human motion capture from an egocentric monocular event camera with a fisheye lens, and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation, enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz.
Paper Structure (29 sections, 6 equations, 10 figures, 5 tables)

This paper contains 29 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of our EventEgo3D approach. The HMD captures an egocentric event stream converted to a series of 2D LNES frames rudnev2021eventhands, from which our neural architecture regresses the 3D poses of the HMD user. The residual event propagation module (REPM) emphasises events triggered around the human by considering the temporal context of observations (realised with a frame buffer with event decay based on event confidence). REPM, hence, helps the encoder-decoder (from LNES to heatmaps) and the heatmap lifting module to estimate accurate 3D human poses. The method is supervised with ground-truth segmentations, heatmaps and 3D human poses.
  • Figure 2: The frame buffer holds previous input frame $\mathbf{\hat{L}}_{q-1}$ (a) and previous confident map $\mathbf{C}_{q-1}$ (b). The $\mathbf{\hat{L}}_{q-1}$ is weighted with $\mathbf{C}_{q-1}$ and added to the current LNES frame $\mathbf{L}_q$ (c) to produce $\mathbf{\hat{L}}_{q}$ (d). We can observe that the events generated by the subject are highlighted more compared to the background events, thereby prioritising events generated by the subject.
  • Figure 3: Sample from EE3D-S with synthetic RGB image (left), generated event stream (middle), and human body mask (right).
  • Figure 4: Sample from EE3D-R with motion tracking setup (left) used for obtaining the ground-truth 3D poses, event stream (middle), and human body mask (right).
  • Figure 5: Qualitative results of our method in comparison to Xu et al.xu2019mo2cap2 and Rudnev et al.rudnev2021eventhands. Note how the previous methods fail to estimate accurate 3D poses when events generated by the background become more prevalent than events around the human. The predictions are in red and the ground truth is in green.
  • ...and 5 more figures