Table of Contents
Fetching ...

Event-based Egocentric Human Pose Estimation in Dynamic Environment

Wataru Ikeda, Masashi Hatano, Ryosei Hara, Mariko Isogawa

TL;DR

This work tackles egocentric 3D human pose estimation from a front-facing head-mounted event camera in dynamic environments. It introduces D-EventEgo, a three-stage pipeline that voxelizes the event stream into a background-extracting voxel grid V in $R^{T x H x W x B}$, estimates head pose H in $R^{T x D'}$, and generates full-body poses X in $R^{T x D}$ via a conditional diffusion model. Key contributions include a Motion Segmentation Module to remove dynamic objects, a synthetic EgoBody-derived event dataset, and experimental validation showing improvements over a RGB baseline on four of five metrics. The results demonstrate robustness to low-light and motion blur, highlighting the potential of event cameras for practical egocentric pose estimation and suggesting future work integrating RGB data and environmental context.

Abstract

Estimating human pose using a front-facing egocentric camera is essential for applications such as sports motion analysis, VR/AR, and AI for wearable devices. However, many existing methods rely on RGB cameras and do not account for low-light environments or motion blur. Event-based cameras have the potential to address these challenges. In this work, we introduce a novel task of human pose estimation using a front-facing event-based camera mounted on the head and propose D-EventEgo, the first framework for this task. The proposed method first estimates the head poses, and then these are used as conditions to generate body poses. However, when estimating head poses, the presence of dynamic objects mixed with background events may reduce head pose estimation accuracy. Therefore, we introduce the Motion Segmentation Module to remove dynamic objects and extract background information. Extensive experiments on our synthetic event-based dataset derived from EgoBody, demonstrate that our approach outperforms our baseline in four out of five evaluation metrics in dynamic environments.

Event-based Egocentric Human Pose Estimation in Dynamic Environment

TL;DR

This work tackles egocentric 3D human pose estimation from a front-facing head-mounted event camera in dynamic environments. It introduces D-EventEgo, a three-stage pipeline that voxelizes the event stream into a background-extracting voxel grid V in , estimates head pose H in , and generates full-body poses X in via a conditional diffusion model. Key contributions include a Motion Segmentation Module to remove dynamic objects, a synthetic EgoBody-derived event dataset, and experimental validation showing improvements over a RGB baseline on four of five metrics. The results demonstrate robustness to low-light and motion blur, highlighting the potential of event cameras for practical egocentric pose estimation and suggesting future work integrating RGB data and environmental context.

Abstract

Estimating human pose using a front-facing egocentric camera is essential for applications such as sports motion analysis, VR/AR, and AI for wearable devices. However, many existing methods rely on RGB cameras and do not account for low-light environments or motion blur. Event-based cameras have the potential to address these challenges. In this work, we introduce a novel task of human pose estimation using a front-facing event-based camera mounted on the head and propose D-EventEgo, the first framework for this task. The proposed method first estimates the head poses, and then these are used as conditions to generate body poses. However, when estimating head poses, the presence of dynamic objects mixed with background events may reduce head pose estimation accuracy. Therefore, we introduce the Motion Segmentation Module to remove dynamic objects and extract background information. Extensive experiments on our synthetic event-based dataset derived from EgoBody, demonstrate that our approach outperforms our baseline in four out of five evaluation metrics in dynamic environments.

Paper Structure

This paper contains 11 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview. (a) Experimental setup (blue: camera-wearing subject, white: pedestrians not involved in pose estimation); (b) input event data; (c) estimated human 3D mesh.
  • Figure 2: Comparison of Egocentric Pose Estimation Methods. Comparison of RGB and event-based cameras and whether they assume body visibility.
  • Figure 3: Voxel Grid and Motion Segmentation. Comparison between the voxel grid from raw data and the voxel grid after motion segmentation.
  • Figure 4: Overview of D-EventEgo. The proposed model processes a sequence of egocentric event data as input and constructs a voxel grid to facilitate background extraction using the Motion Segmentation Module. Subsequently, the extracted background information is utilized by the Head Pose Estimation Module to determine the head pose. Finally, the Body Pose Estimation Module generates the full-body pose based on the estimated head pose.
  • Figure 5: Qualitative results. We show the results of three event-based data sequences from different scenes. The arrows indicate incorrect movements.