EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality

Haojie Cheng; Shaun Jing Heng Ong; Shaoyu Cai; Aiden Tat Yang Koh; Fuxi Ouyang; Eng Tat Khoo

EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality

Haojie Cheng, Shaun Jing Heng Ong, Shaoyu Cai, Aiden Tat Yang Koh, Fuxi Ouyang, Eng Tat Khoo

TL;DR

EgoPoseVR tackles the challenging problem of egocentric full-body pose estimation in VR by fusing sparse HMD motion data with egocentric RGB-D observations from a headset-mounted camera. The approach uses dual-stream spatiotemporal transformers to extract motion and visual cues, a cross-modal attention module to fuse information, and a kinematic optimizer to enforce alignment with VR signals and skeletal priors. A large synthetic VR dataset with temporally synchronized HMD and RGB-D data supports training and evaluation, and experiments show improved accuracy and real-time performance over state-of-the-art baselines, complemented by a user study in real environments. This framework enables accurate VR embodiment without additional body-worn sensors or room-scale tracking, offering practical applications in interactive VR, rehabilitation, and immersive experiences.

Abstract

Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.

EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality

TL;DR

Abstract

Paper Structure (27 sections, 7 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 7 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Egocentric Pose Estimation in VR
Motion Tracking from Egocentric Camera
Spatiotemporal Modeling for Temporal Dynamics
Methodology
Overview
Input Modalities and Pose Representation
HMD and RGB-D Feature Modeling
HMD Stream Encoder
RGB-D Stream Encoder.
Cross-Modal Spatiotemporal Integration.
Kinematic Pose Optimization via Energy Functions
Head and Hands Alignment Term
Skeletal Structure Preservation Term
...and 12 more sections

Figures (11)

Figure 1: Workflow of our EgoPoseVR framework for egocentric full-body pose estimation in VR. HMD motion and egocentric downward-facing RGB-D inputs are jointly encoded by the spatiotemporal module to predict full-body poses. A tailored kinematic optimization via energy functions ensures structurally consistent and immersive avatar rendering.
Figure 2: Pose refinement results under different kinematic pose optimization settings. The semi-transparent mesh is the ground-truth pose, while the predicted mesh is color-coded by the pose error (redder indicates higher error).
Figure 3: Example images from our dataset showing third-person and egocentric RGB-D views. Green dots indicate 2D joint projections and pink dashed boxes mark the motion blur regions.
Figure 4: System architecture illustrating cross-device data transmission for real-time SMPL pose estimation and visualization.
Figure 5: Qualitative results of module-level comparison with state-of-the-art methods. The leftmost blue avatar represents the ground truth, while the remaining five avatars from left to right correspond respectively to the methods listed in Rows A–E of Table \ref{['tab:Quantitative_DifferentMethods']}. Pose errors are color-coded relative to the ground truth, with deeper red indicating greater error.
...and 6 more figures

EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality

TL;DR

Abstract

EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality

Authors

TL;DR

Abstract

Table of Contents

Figures (11)