Table of Contents
Fetching ...

SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras

Hanz Cuevas-Velasquez, Charlie Hewitt, Sadegh Aliakbarian, Tadas Baltrušaitis

TL;DR

This work tackles egocentric full-body pose estimation from downward-facing rectilinear HMD cameras by regressing joint rotations directly as matrix Fisher distributions over $SO(3)$, enabling uncertainty quantification for occluded or out-of-frame joints. The method avoids 2D heatmaps, uses a compact end-to-end architecture, and leverages the SMPL-H* body model to produce meshes with pose uncertainty. It introduces the SynthEgo synthetic dataset with 60K stereo pairs and 54 joints, showing state-of-the-art accuracy while being significantly faster and smaller than prior methods, and demonstrates strong generalization to real-world data. Overall, the approach provides reliable, uncertainty-aware egocentric pose estimates suitable for live avatar animation on resource-constrained devices and offers insights into learned pose priors and confidence calibration.

Abstract

Our work addresses the problem of egocentric human pose estimation from downwards-facing cameras on head-mounted devices (HMD). This presents a challenging scenario, as parts of the body often fall outside of the image or are occluded. Previous solutions minimize this problem by using fish-eye camera lenses to capture a wider view, but these can present hardware design issues. They also predict 2D heat-maps per joint and lift them to 3D space to deal with self-occlusions, but this requires large network architectures which are impractical to deploy on resource-constrained HMDs. We predict pose from images captured with conventional rectilinear camera lenses. This resolves hardware design issues, but means body parts are often out of frame. As such, we directly regress probabilistic joint rotations represented as matrix Fisher distributions for a parameterized body model. This allows us to quantify pose uncertainties and explain out-of-frame or occluded joints. This also removes the need to compute 2D heat-maps and allows for simplified DNN architectures which require less compute. Given the lack of egocentric datasets using rectilinear camera lenses, we introduce the SynthEgo dataset, a synthetic dataset with 60K stereo images containing high diversity of pose, shape, clothing and skin tone. Our approach achieves state-of-the-art results for this challenging configuration, reducing mean per-joint position error by 23% overall and 58% for the lower body. Our architecture also has eight times fewer parameters and runs twice as fast as the current state-of-the-art. Experiments show that training on our synthetic dataset leads to good generalization to real world images without fine-tuning.

SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras

TL;DR

This work tackles egocentric full-body pose estimation from downward-facing rectilinear HMD cameras by regressing joint rotations directly as matrix Fisher distributions over , enabling uncertainty quantification for occluded or out-of-frame joints. The method avoids 2D heatmaps, uses a compact end-to-end architecture, and leverages the SMPL-H* body model to produce meshes with pose uncertainty. It introduces the SynthEgo synthetic dataset with 60K stereo pairs and 54 joints, showing state-of-the-art accuracy while being significantly faster and smaller than prior methods, and demonstrates strong generalization to real-world data. Overall, the approach provides reliable, uncertainty-aware egocentric pose estimates suitable for live avatar animation on resource-constrained devices and offers insights into learned pose priors and confidence calibration.

Abstract

Our work addresses the problem of egocentric human pose estimation from downwards-facing cameras on head-mounted devices (HMD). This presents a challenging scenario, as parts of the body often fall outside of the image or are occluded. Previous solutions minimize this problem by using fish-eye camera lenses to capture a wider view, but these can present hardware design issues. They also predict 2D heat-maps per joint and lift them to 3D space to deal with self-occlusions, but this requires large network architectures which are impractical to deploy on resource-constrained HMDs. We predict pose from images captured with conventional rectilinear camera lenses. This resolves hardware design issues, but means body parts are often out of frame. As such, we directly regress probabilistic joint rotations represented as matrix Fisher distributions for a parameterized body model. This allows us to quantify pose uncertainties and explain out-of-frame or occluded joints. This also removes the need to compute 2D heat-maps and allows for simplified DNN architectures which require less compute. Given the lack of egocentric datasets using rectilinear camera lenses, we introduce the SynthEgo dataset, a synthetic dataset with 60K stereo images containing high diversity of pose, shape, clothing and skin tone. Our approach achieves state-of-the-art results for this challenging configuration, reducing mean per-joint position error by 23% overall and 58% for the lower body. Our architecture also has eight times fewer parameters and runs twice as fast as the current state-of-the-art. Experiments show that training on our synthetic dataset leads to good generalization to real world images without fine-tuning.
Paper Structure (24 sections, 2 equations, 6 figures, 4 tables)

This paper contains 24 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our proposed method takes images from a head-mounted camera as input, and returns the parameters of a Fisher distribution for matrix rotations for each joint. From the output, we can determine joint rotations and so reconstruct the body of the user, as well as explain the pose prediction based on the uncertainties of the predicted distributions.
  • Figure 2: Example scenes from the SynthEgo dataset showing the left and right egocentric views used for training and an external viewpoint used for visualization only.
  • Figure 3: Comparison of 3D joint location results (blue) overlaid on GT (red) for two synthetic and two real images. We also show predicted (blue) and GT (orange) body meshes. Our method accurately recovers joint locations and rotations.
  • Figure 4: Uncertainty outputs of our method on our real dataset. The second column displays the uncertainty of each joint obtained by summing the concentration parameters along the three axes of each joint. Joints with high uncertainty are typically not visible in the input image. The third column shows the per-vertex uncertainty obtained by sampling the joint rotation distributions.
  • Figure 5: Concentration parameter and joint degree-of-freedom relationships. Averaged concentration parameters from our real dataset show higher concentration values are correlated with axes with lower freedom.
  • ...and 1 more figures