Table of Contents
Fetching ...

EventEgo3D++: 3D Human Motion Capture from a Head-Mounted Event Camera

Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Alain Pagani, Didier Stricker, Christian Theobalt, Vladislav Golyanik

TL;DR

EventEgo3D++ tackles egocentric 3D human motion capture with a single head-mounted event camera, addressing RGB weaknesses in low light and fast motion. It introduces a two-branch architecture that combines Egocentric Pose Module with a Residual Event Propagation Module, leveraging LNES frames and a bone-aware, multi-loss supervision to produce accurate 3D poses at 140 Hz. The work provides three new datasets (EE3D-R, EE3D-W, EE3D-S) plus allocentric RGB/SMPL annotations, and demonstrates state-of-the-art accuracy across synthetic and real-world scenarios, including in-the-wild conditions, while maintaining real-time efficiency. These contributions advance robust, real-time egocentric vision for VR/AR and motion analysis, and the released datasets will catalyze further research in event-based 3D perception.

Abstract

Monocular egocentric 3D human motion capture remains a significant challenge, particularly under conditions of low lighting and fast movements, which are common in head-mounted device applications. Existing methods that rely on RGB cameras often fail under these conditions. To address these limitations, we introduce EventEgo3D++, the first approach that leverages a monocular event camera with a fisheye lens for 3D human motion capture. Event cameras excel in high-speed scenarios and varying illumination due to their high temporal resolution, providing reliable cues for accurate 3D human motion capture. EventEgo3D++ leverages the LNES representation of event streams to enable precise 3D reconstructions. We have also developed a mobile head-mounted device (HMD) prototype equipped with an event camera, capturing a comprehensive dataset that includes real event observations from both controlled studio environments and in-the-wild settings, in addition to a synthetic dataset. Additionally, to provide a more holistic dataset, we include allocentric RGB streams that offer different perspectives of the HMD wearer, along with their corresponding SMPL body model. Our experiments demonstrate that EventEgo3D++ achieves superior 3D accuracy and robustness compared to existing solutions, even in challenging conditions. Moreover, our method supports real-time 3D pose updates at a rate of 140Hz. This work is an extension of the EventEgo3D approach (CVPR 2024) and further advances the state of the art in egocentric 3D human motion capture. For more details, visit the project page at https://eventego3d.mpi-inf.mpg.de.

EventEgo3D++: 3D Human Motion Capture from a Head-Mounted Event Camera

TL;DR

EventEgo3D++ tackles egocentric 3D human motion capture with a single head-mounted event camera, addressing RGB weaknesses in low light and fast motion. It introduces a two-branch architecture that combines Egocentric Pose Module with a Residual Event Propagation Module, leveraging LNES frames and a bone-aware, multi-loss supervision to produce accurate 3D poses at 140 Hz. The work provides three new datasets (EE3D-R, EE3D-W, EE3D-S) plus allocentric RGB/SMPL annotations, and demonstrates state-of-the-art accuracy across synthetic and real-world scenarios, including in-the-wild conditions, while maintaining real-time efficiency. These contributions advance robust, real-time egocentric vision for VR/AR and motion analysis, and the released datasets will catalyze further research in event-based 3D perception.

Abstract

Monocular egocentric 3D human motion capture remains a significant challenge, particularly under conditions of low lighting and fast movements, which are common in head-mounted device applications. Existing methods that rely on RGB cameras often fail under these conditions. To address these limitations, we introduce EventEgo3D++, the first approach that leverages a monocular event camera with a fisheye lens for 3D human motion capture. Event cameras excel in high-speed scenarios and varying illumination due to their high temporal resolution, providing reliable cues for accurate 3D human motion capture. EventEgo3D++ leverages the LNES representation of event streams to enable precise 3D reconstructions. We have also developed a mobile head-mounted device (HMD) prototype equipped with an event camera, capturing a comprehensive dataset that includes real event observations from both controlled studio environments and in-the-wild settings, in addition to a synthetic dataset. Additionally, to provide a more holistic dataset, we include allocentric RGB streams that offer different perspectives of the HMD wearer, along with their corresponding SMPL body model. Our experiments demonstrate that EventEgo3D++ achieves superior 3D accuracy and robustness compared to existing solutions, even in challenging conditions. Moreover, our method supports real-time 3D pose updates at a rate of 140Hz. This work is an extension of the EventEgo3D approach (CVPR 2024) and further advances the state of the art in egocentric 3D human motion capture. For more details, visit the project page at https://eventego3d.mpi-inf.mpg.de.

Paper Structure

This paper contains 39 sections, 16 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: EventEgo3D++ builds upon the work of EventEgo3D Millerdurai_EventEgo3D_2024 for real-time 3D human motion capture from egocentric event streams: (a) A photograph of our new head-mounted device (HMD) with a custom-designed egocentric fisheye event camera (top) and visualisations of our synthetically rendered dataset and a real dataset recorded with the HMD (bottom); (b) Real-time demo achieving the pose update rate of $140$Hz; (c) Visualisation of real event streams (top) and the corresponding 3D human poses from a third-person perspective.
  • Figure 2: Overview of our EventEgo3D++ approach. The HMD captures an egocentric event stream, which is then converted to a series of 2D LNES frames rudnev2021eventhands as inputs to our neural architecture to estimate the 3D poses of the HMD user. The residual event propagation module (REPM) emphasises events triggered around the human by considering the temporal context of observations (realised with a frame buffer with event decay based on event confidence). REPM, hence, helps the encoder-decoder (from LNES to heatmaps) and the heatmap lifting module to estimate accurate 3D human poses. The method is supervised with ground-truth human body masks, heatmaps and 3D human poses.
  • Figure 3: The network architecture of EventEgo3D++. The Encoder takes the current LNES frame $\mathbf{\hat{L}}_{q}$ as an input. The Heatmap Decoder predicts 2D heatmaps for 16 body joints, which are then fed into the HM-to-3D lifting block to regress 3D joint locations. The Segmentation Decoder generates the human body mask, and the Confidence Decoder subsequently produces a feature map that acts on the human body mask to create a confidence map, highlighting important regions in the egocentric view.
  • Figure 4: Visualisation of frame buffering and human-weighted event generation. The frame buffer holds previous input frame $\mathbf{\hat{L}}_{q-1}$ (a) and previous confident map $\mathbf{C}_{q-1}$ (b). $\mathbf{\hat{L}}_{q-1}$ is weighted with $\mathbf{C}_{q-1}$ and added to the current LNES frame $\mathbf{L}_q$ (c) to produce $\mathbf{\hat{L}}_{q}$ (d). We can observe that the events generated by the subject are highlighted more than the background events.
  • Figure 5: Our real-world setup. The head-mounted device is equipped with an event camera and a fisheye lens.
  • ...and 15 more figures