EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling
Songpengcheng Xia, Yu Zhang, Zhuo Su, Xiaozheng Zheng, Zheng Lv, Guidong Wang, Yongjie Zhang, Qi Wu, Lei Chu, Ling Pei
TL;DR
EnvPoser addresses full-body motion estimation from sparse HMD/hand signals by introducing a two-stage, environment-aware framework that explicitly models multi-hypothesis uncertainty and refines results with semantic and geometric scene constraints. Stage I produces uncertainty-aware initial estimates to capture multiple plausible motions, while Stage II leverages a cropped environmental point cloud and cross-attention to enforce realistic interactions and prevent collisions. The approach achieves state-of-the-art results on EgoBody and GIMO, reducing both angular and positional errors and improving motion smoothness, particularly in environments with frequent interactions. By incorporating environmental context and uncertainty quantification, EnvPoser offers robust, realistic motion estimation for VR/AR applications and lays groundwork for dynamic-scene extensions and richer sensory integration.
Abstract
Estimating full-body motion using the tracking signals of head and hands from VR devices holds great potential for various applications. However, the sparsity and unique distribution of observations present a significant challenge, resulting in an ill-posed problem with multiple feasible solutions (i.e., hypotheses). This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. EnvPoser models the multi-hypothesis nature of human motion through an uncertainty-aware estimation module in the first stage. In the second stage, we refine these multi-hypothesis estimates by integrating semantic and geometric environmental constraints, ensuring that the final motion estimation aligns realistically with both the environmental context and physical interactions. Qualitative and quantitative experiments on two public datasets demonstrate that our method achieves state-of-the-art performance, highlighting significant improvements in human motion estimation within motion-environment interaction scenarios.
