Table of Contents
Fetching ...

EgoCast: Forecasting Egocentric Human Pose in the Wild

Maria Escobar, Juanita Puentes, Cristhian Forigua, Jordi Pont-Tuset, Kevis-Kokitsi Maninis, Pablo Arbelaez

TL;DR

EgoCast tackles egocentric 3D human pose forecasting in unconstrained real-world settings by introducing a current-frame estimation module that produces pseudo-groundtruth poses, removing the need for past ground-truth poses at inference. It fuses proprioceptive head pose with egocentric visual input through a bimodal Transformer to perform both single-frame pose estimation and multi-frame forecasting. The authors propose a realistic benchmarking setup across horizons from 0.5 to 5 seconds using Ego-Exo4D and Aria Digital Twin datasets, reporting MPJPE and AUC metrics and achieving state-of-the-art results on Ego-Exo4D Body Pose Challenge. The work demonstrates the value of integrating internal proprioception with visual cues for robust, long-horizon, egocentric motion understanding with practical AR implications.

Abstract

Accurately estimating and forecasting human body pose is important for enhancing the user's sense of immersion in Augmented Reality. Addressing this need, our paper introduces EgoCast, a bimodal method for 3D human pose forecasting using egocentric videos and proprioceptive data. We study the task of human pose forecasting in a realistic setting, extending the boundaries of temporal forecasting in dynamic scenes and building on the current framework for current pose estimation in the wild. We introduce a current-frame estimation module that generates pseudo-groundtruth poses for inference, eliminating the need for past groundtruth poses typically required by current methods during forecasting. Our experimental results on the recent Ego-Exo4D and Aria Digital Twin datasets validate EgoCast for real-life motion estimation. On the Ego-Exo4D Body Pose 2024 Challenge, our method significantly outperforms the state-of-the-art approaches, laying the groundwork for future research in human pose estimation and forecasting in unscripted activities with egocentric inputs.

EgoCast: Forecasting Egocentric Human Pose in the Wild

TL;DR

EgoCast tackles egocentric 3D human pose forecasting in unconstrained real-world settings by introducing a current-frame estimation module that produces pseudo-groundtruth poses, removing the need for past ground-truth poses at inference. It fuses proprioceptive head pose with egocentric visual input through a bimodal Transformer to perform both single-frame pose estimation and multi-frame forecasting. The authors propose a realistic benchmarking setup across horizons from 0.5 to 5 seconds using Ego-Exo4D and Aria Digital Twin datasets, reporting MPJPE and AUC metrics and achieving state-of-the-art results on Ego-Exo4D Body Pose Challenge. The work demonstrates the value of integrating internal proprioception with visual cues for robust, long-horizon, egocentric motion understanding with practical AR implications.

Abstract

Accurately estimating and forecasting human body pose is important for enhancing the user's sense of immersion in Augmented Reality. Addressing this need, our paper introduces EgoCast, a bimodal method for 3D human pose forecasting using egocentric videos and proprioceptive data. We study the task of human pose forecasting in a realistic setting, extending the boundaries of temporal forecasting in dynamic scenes and building on the current framework for current pose estimation in the wild. We introduce a current-frame estimation module that generates pseudo-groundtruth poses for inference, eliminating the need for past groundtruth poses typically required by current methods during forecasting. Our experimental results on the recent Ego-Exo4D and Aria Digital Twin datasets validate EgoCast for real-life motion estimation. On the Ego-Exo4D Body Pose 2024 Challenge, our method significantly outperforms the state-of-the-art approaches, laying the groundwork for future research in human pose estimation and forecasting in unscripted activities with egocentric inputs.

Paper Structure

This paper contains 13 sections, 4 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: 3D Human Pose Forecasting. Our forecasting approach focuses on studying human movement from egocentric inputs in a realistic setting. Given the headset trajectory (3D position and rotation) of the past ($t-k, t$), represented as the orange line in our figure, and the visual cues gathered during the past trajectory, the goal is to forecast the 3D full-body human pose in a future temporal window ($t-t_{n}$), as shown in the right side of the figure. Note that we do not receive as input ground-truth historical poses.
  • Figure 2: EgoCast Overview. Our method leverages proprioception and visual streams to estimate 3D human pose. (Top) For forecasting, we input previous camera poses and 3D full-body pose predictions through a forecasting head to estimate future 3D poses from $t+1$ to $t+n$. (Bottom) Since ground-truth 3D full-body poses are not available in real-case scenarios, we implement a current-frame estimation module that integrates camera poses and visual cues to estimate 3D pose at time $t$.
  • Figure 3: Effect of visual cues on MPJPE for the current-frame estimation module. For each joint, we present the Mean Per-Joint Position Error (MPJPE) variation, contrasting conditions without visual cues against those with visual cues, through a color scale from blue (low error) to red (high error). Visual egocentric data significantly reduces errors, especially in the lower body.
  • Figure 4: 3D Human Pose Estimation with and without Visual Inputs. A common assumption in human pose estimation is that individuals always stand with their hands by their sides. However, integrating visual information into our Current-Frame Estimation module challenges this notion, accurately predicting when a person sits down or raises their hands.
  • Figure 5: Ego-Exo4D Forecasting at different timeframes. We show performance curves for forecasting at {0.5, 1, 2, 3, 4, and 5} seconds in the future. We compare our final EgoCast approach against a forecasting extension of the current state-of-the-art method for current-frame pose estimation and an Oracle approach that aligns the trajectories with the ground truth. Note that since the graph shows MPJPE, lower curves represent better performance.
  • ...and 4 more figures