Self-Avatar Animation in Virtual Reality: Impact of Motion Signals Artifacts on the Full-Body Pose Reconstruction
Antoine Maiorca, Seyed Abolfazl Ghasemzadeh, Thierry Ravet, François Cresson, Thierry Dutoit, Christophe De Vleeschouwer
TL;DR
The paper addresses the challenge of animating lifelike full-body self-avatars in VR when lower-body tracking is absent by augmenting sparse VR signals with external 3D Cartesian positions derived from RGB(D) cameras. It formalizes the fusion problem as $Y_T = \Phi(X_{0,...,T} \oplus X^F_{0,...,T})$ and introduces artifacts such as latency $d$, framerate mismatch via $fps_{ratio}$, occlusion, and Gaussian noise, including a YOLOv8-based estimation pathway for $X^F$. By adapting AvatarPoser (Transformer) and HybridTrack (CNN-1D) and training on AMASS with SMPL, the study systematically degrades motion data to evaluate reconstruction performance using $MPJPE$, $MPJRE$, and $MPJVE$, comparing ground-truth vs YOLOv8-derived Cartesian inputs. The results show that errors increase with artifact intensity, with HybridTrack often more robust to occlusion and noise, while AvatarPoser can benefit from external $X^F$ under certain conditions; velocity reconstruction remains particularly vulnerable. Overall, the work provides practical insights for multimodal self-avatar pipelines and highlights the need for advanced temporal fusion to mitigate desynchronization and artifact effects in real-time VR applications.
Abstract
Virtual Reality (VR) applications have revolutionized user experiences by immersing individuals in interactive 3D environments. These environments find applications in numerous fields, including healthcare, education, or architecture. A significant aspect of VR is the inclusion of self-avatars, representing users within the virtual world, which enhances interaction and embodiment. However, generating lifelike full-body self-avatar animations remains challenging, particularly in consumer-grade VR systems, where lower-body tracking is often absent. One method to tackle this problem is by providing an external source of motion information that includes lower body information such as full Cartesian positions estimated from RGB(D) cameras. Nevertheless, the limitations of these systems are multiples: the desynchronization between the two motion sources and occlusions are examples of significant issues that hinder the implementations of such systems. In this paper, we aim to measure the impact on the reconstruction of the articulated self-avatar's full-body pose of (1) the latency between the VR motion features and estimated positions, (2) the data acquisition rate, (3) occlusions, and (4) the inaccuracy of the position estimation algorithm. In addition, we analyze the motion reconstruction errors using ground truth and 3D Cartesian coordinates estimated from \textit{YOLOv8} pose estimation. These analyzes show that the studied methods are significantly sensitive to any degradation tested, especially regarding the velocity reconstruction error.
