EgoCast: Forecasting Egocentric Human Pose in the Wild
Maria Escobar, Juanita Puentes, Cristhian Forigua, Jordi Pont-Tuset, Kevis-Kokitsi Maninis, Pablo Arbelaez
TL;DR
EgoCast tackles egocentric 3D human pose forecasting in unconstrained real-world settings by introducing a current-frame estimation module that produces pseudo-groundtruth poses, removing the need for past ground-truth poses at inference. It fuses proprioceptive head pose with egocentric visual input through a bimodal Transformer to perform both single-frame pose estimation and multi-frame forecasting. The authors propose a realistic benchmarking setup across horizons from 0.5 to 5 seconds using Ego-Exo4D and Aria Digital Twin datasets, reporting MPJPE and AUC metrics and achieving state-of-the-art results on Ego-Exo4D Body Pose Challenge. The work demonstrates the value of integrating internal proprioception with visual cues for robust, long-horizon, egocentric motion understanding with practical AR implications.
Abstract
Accurately estimating and forecasting human body pose is important for enhancing the user's sense of immersion in Augmented Reality. Addressing this need, our paper introduces EgoCast, a bimodal method for 3D human pose forecasting using egocentric videos and proprioceptive data. We study the task of human pose forecasting in a realistic setting, extending the boundaries of temporal forecasting in dynamic scenes and building on the current framework for current pose estimation in the wild. We introduce a current-frame estimation module that generates pseudo-groundtruth poses for inference, eliminating the need for past groundtruth poses typically required by current methods during forecasting. Our experimental results on the recent Ego-Exo4D and Aria Digital Twin datasets validate EgoCast for real-life motion estimation. On the Ego-Exo4D Body Pose 2024 Challenge, our method significantly outperforms the state-of-the-art approaches, laying the groundwork for future research in human pose estimation and forecasting in unscripted activities with egocentric inputs.
