Table of Contents
Fetching ...

Self-Avatar Animation in Virtual Reality: Impact of Motion Signals Artifacts on the Full-Body Pose Reconstruction

Antoine Maiorca, Seyed Abolfazl Ghasemzadeh, Thierry Ravet, François Cresson, Thierry Dutoit, Christophe De Vleeschouwer

TL;DR

The paper addresses the challenge of animating lifelike full-body self-avatars in VR when lower-body tracking is absent by augmenting sparse VR signals with external 3D Cartesian positions derived from RGB(D) cameras. It formalizes the fusion problem as $Y_T = \Phi(X_{0,...,T} \oplus X^F_{0,...,T})$ and introduces artifacts such as latency $d$, framerate mismatch via $fps_{ratio}$, occlusion, and Gaussian noise, including a YOLOv8-based estimation pathway for $X^F$. By adapting AvatarPoser (Transformer) and HybridTrack (CNN-1D) and training on AMASS with SMPL, the study systematically degrades motion data to evaluate reconstruction performance using $MPJPE$, $MPJRE$, and $MPJVE$, comparing ground-truth vs YOLOv8-derived Cartesian inputs. The results show that errors increase with artifact intensity, with HybridTrack often more robust to occlusion and noise, while AvatarPoser can benefit from external $X^F$ under certain conditions; velocity reconstruction remains particularly vulnerable. Overall, the work provides practical insights for multimodal self-avatar pipelines and highlights the need for advanced temporal fusion to mitigate desynchronization and artifact effects in real-time VR applications.

Abstract

Virtual Reality (VR) applications have revolutionized user experiences by immersing individuals in interactive 3D environments. These environments find applications in numerous fields, including healthcare, education, or architecture. A significant aspect of VR is the inclusion of self-avatars, representing users within the virtual world, which enhances interaction and embodiment. However, generating lifelike full-body self-avatar animations remains challenging, particularly in consumer-grade VR systems, where lower-body tracking is often absent. One method to tackle this problem is by providing an external source of motion information that includes lower body information such as full Cartesian positions estimated from RGB(D) cameras. Nevertheless, the limitations of these systems are multiples: the desynchronization between the two motion sources and occlusions are examples of significant issues that hinder the implementations of such systems. In this paper, we aim to measure the impact on the reconstruction of the articulated self-avatar's full-body pose of (1) the latency between the VR motion features and estimated positions, (2) the data acquisition rate, (3) occlusions, and (4) the inaccuracy of the position estimation algorithm. In addition, we analyze the motion reconstruction errors using ground truth and 3D Cartesian coordinates estimated from \textit{YOLOv8} pose estimation. These analyzes show that the studied methods are significantly sensitive to any degradation tested, especially regarding the velocity reconstruction error.

Self-Avatar Animation in Virtual Reality: Impact of Motion Signals Artifacts on the Full-Body Pose Reconstruction

TL;DR

The paper addresses the challenge of animating lifelike full-body self-avatars in VR when lower-body tracking is absent by augmenting sparse VR signals with external 3D Cartesian positions derived from RGB(D) cameras. It formalizes the fusion problem as and introduces artifacts such as latency , framerate mismatch via , occlusion, and Gaussian noise, including a YOLOv8-based estimation pathway for . By adapting AvatarPoser (Transformer) and HybridTrack (CNN-1D) and training on AMASS with SMPL, the study systematically degrades motion data to evaluate reconstruction performance using , , and , comparing ground-truth vs YOLOv8-derived Cartesian inputs. The results show that errors increase with artifact intensity, with HybridTrack often more robust to occlusion and noise, while AvatarPoser can benefit from external under certain conditions; velocity reconstruction remains particularly vulnerable. Overall, the work provides practical insights for multimodal self-avatar pipelines and highlights the need for advanced temporal fusion to mitigate desynchronization and artifact effects in real-time VR applications.

Abstract

Virtual Reality (VR) applications have revolutionized user experiences by immersing individuals in interactive 3D environments. These environments find applications in numerous fields, including healthcare, education, or architecture. A significant aspect of VR is the inclusion of self-avatars, representing users within the virtual world, which enhances interaction and embodiment. However, generating lifelike full-body self-avatar animations remains challenging, particularly in consumer-grade VR systems, where lower-body tracking is often absent. One method to tackle this problem is by providing an external source of motion information that includes lower body information such as full Cartesian positions estimated from RGB(D) cameras. Nevertheless, the limitations of these systems are multiples: the desynchronization between the two motion sources and occlusions are examples of significant issues that hinder the implementations of such systems. In this paper, we aim to measure the impact on the reconstruction of the articulated self-avatar's full-body pose of (1) the latency between the VR motion features and estimated positions, (2) the data acquisition rate, (3) occlusions, and (4) the inaccuracy of the position estimation algorithm. In addition, we analyze the motion reconstruction errors using ground truth and 3D Cartesian coordinates estimated from \textit{YOLOv8} pose estimation. These analyzes show that the studied methods are significantly sensitive to any degradation tested, especially regarding the velocity reconstruction error.
Paper Structure (10 sections, 1 equation, 4 figures, 1 table)

This paper contains 10 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Reconstruction of the full-body position based on YOLOv8 algorithm. (a) and (b) Illustration of YOLOv8 detection applied to one frame of AMASS dataset rendered from two different perspectives. (c) the 3D Cartesian reconstruction pose computed by triangulation (red) compared with ground truth frame (blue)
  • Figure 2: Examples of artifacts on motion signals. Top left: delay between two motion sources. Top Right: Cartesian position framerate reduced by $fps_{ratio} = 2$. Bottom Left: Gaussian noise applied on positional data. Bottom Right: Random occlusion i.e., a joint position randomly set to zero.
  • Figure 3: Reconstruction errors regarding the models trained with ground truth and Cartesian positions from YOLOv8.
  • Figure 4: Illustration of two pose samples derived from the ground truth data (Left). In the Mid-Left image, AvatarPoser is trained with ground truth 3D Cartesian positions and provided with a sequence of this ground truth. The Mid-Right image displays the effects of introducing Gaussian noise into the 3D Cartesian coordinates, with noise levels $\sigma = 0.01$. Right: AvatarPoser trained with only sparse inputs. The positional features are improved in comparison to those produced by solely the sparse inputs, even when the Cartesian coordinates are degraded with a Gaussian noise with $\sigma=1cm$.