Table of Contents
Fetching ...

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

Jiaxi Jiang, Paul Streli, Manuel Meier, Christian Holz

TL;DR

EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.

Abstract

Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to overcome these limitations with four main contributions. 1) EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset's field of view. 2) We rethink input representations for headset-based ego-pose estimation and introduce a novel global motion decomposition method that predicts full-body pose independent of global positions. 3) We enhance pose estimation by capturing longer motion time series through an efficient SlowFast module design that maintains computational efficiency. 4) EgoPoser generalizes across various body shapes for different users. We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively while maintaining a high inference speed of over 600fps. EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.

EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere

TL;DR

EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.

Abstract

Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to overcome these limitations with four main contributions. 1) EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset's field of view. 2) We rethink input representations for headset-based ego-pose estimation and introduce a novel global motion decomposition method that predicts full-body pose independent of global positions. 3) We enhance pose estimation by capturing longer motion time series through an efficient SlowFast module design that maintains computational efficiency. 4) EgoPoser generalizes across various body shapes for different users. We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively while maintaining a high inference speed of over 600fps. EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.
Paper Structure (19 sections, 5 equations, 11 figures, 5 tables)

This paper contains 19 sections, 5 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Today's Mixed Reality systems integrate all tracking inside the headset, supporting mobile use in everyday environments. This sacrifices much of the user's body and surroundings for input, when body parts leave the cameras' field of view. Accounting for these constraints, our novel method EgoPoser robustly estimates full-body poses that are spatially and temporally coherent, even from the sparse and intermittent inside-out tracking input available on today's headsets.
  • Figure 2: The architecture of EgoPoser for full-body pose estimation from an MR device. Given N=80 frames as input, we generate the last frame as the full-body representation for each timestamp, facilitating real-time applications.
  • Figure 3: An illustration of an HMD's field of view and in-FoV conditions.
  • Figure 4: An illustration of the temporal and spatial normalizations for robust position-invariant pose estimation.
  • Figure 5: SlowFast feature fusion module. Original signals are sparsely and densely sampled and then concatenated.
  • ...and 6 more figures