Table of Contents
Fetching ...

EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars

Jianchun Chen, Jian Wang, Yinda Zhang, Rohit Pandey, Thabo Beeler, Marc Habermann, Christian Theobalt

TL;DR

This work proposes a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video.

Abstract

Immersive VR telepresence ideally means being able to interact and communicate with digital avatars that are indistinguishable from and precisely reflect the behaviour of their real counterparts. The core technical challenge is two fold: Creating a digital double that faithfully reflects the real human and tracking the real human solely from egocentric sensing devices that are lightweight and have a low energy consumption, e.g. a single RGB camera. Up to date, no unified solution to this problem exists as recent works solely focus on egocentric motion capture, only model the head, or build avatars from multi-view captures. In this work, we, for the first time in literature, propose a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video. We first present a character model that is animatible, i.e. can be solely driven by skeletal motion, while being capable of modeling geometry and appearance. Then, we introduce a personalized egocentric motion capture component, which recovers full-body motion from an egocentric video. Finally, we apply the recovered pose to our character model and perform a test-time mesh refinement such that the geometry faithfully projects onto the egocentric view. To validate our design choices, we propose a new and challenging benchmark, which provides paired egocentric and dense multi-view videos of real humans performing various motions. Our experiments demonstrate a clear step towards egocentric and photoreal telepresence as our method outperforms baselines as well as competing methods. For more details, code, and data, we refer to our project page.

EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars

TL;DR

This work proposes a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video.

Abstract

Immersive VR telepresence ideally means being able to interact and communicate with digital avatars that are indistinguishable from and precisely reflect the behaviour of their real counterparts. The core technical challenge is two fold: Creating a digital double that faithfully reflects the real human and tracking the real human solely from egocentric sensing devices that are lightweight and have a low energy consumption, e.g. a single RGB camera. Up to date, no unified solution to this problem exists as recent works solely focus on egocentric motion capture, only model the head, or build avatars from multi-view captures. In this work, we, for the first time in literature, propose a person-specific egocentric telepresence approach, which jointly models the photoreal digital avatar while also driving it from a single egocentric video. We first present a character model that is animatible, i.e. can be solely driven by skeletal motion, while being capable of modeling geometry and appearance. Then, we introduce a personalized egocentric motion capture component, which recovers full-body motion from an egocentric video. Finally, we apply the recovered pose to our character model and perform a test-time mesh refinement such that the geometry faithfully projects onto the egocentric view. To validate our design choices, we propose a new and challenging benchmark, which provides paired egocentric and dense multi-view videos of real humans performing various motions. Our experiments demonstrate a clear step towards egocentric and photoreal telepresence as our method outperforms baselines as well as competing methods. For more details, code, and data, we refer to our project page.
Paper Structure (47 sections, 17 equations, 8 figures, 4 tables)

This paper contains 47 sections, 17 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of EgoAvatar. Taking as input a single egocentric RGB video, we first detect the skeletal pose in form of 3D keypoints (Sec. \ref{['sec-pose']}) and then solve for the skeleton parameters, i.e. joint angles, using our IKSolver (Sec. \ref{['sec-ik']}). The motion signal drives the mesh-based avatar via our MotionDeformer that is pre-trained on multi-view videos of the actor performing various motions (in Sec. \ref{['sec-ddc-geo']}). At inference time, our EgoDeformer further enhances the egocentric view alignment of the predicted avatar (Sec. \ref{['sec-testtime']}). Finally, our GaussianPredictor generates dynamic Gaussian parameters in the UV space of the character's mesh, which model the motion- and view-dependent appearance of the avatar (Sec. \ref{['sec-dynamic-texture']}). Given the recovered Gaussian parameters representing our character, we can render free viewpoint videos of the avatar that is solely driven from an egocentric RGB video of the real human using Gaussian splatting.
  • Figure 2: Qualitative Results. On the left, we show frames of the egocentric driving video depicting the real human. On the right, we render the virtual avatar closely following the egocentric driving signal. We highlight the high level of detail and photorealism, e.g. the high-frequency texture on the orange pullover. Moreover, our method faithfully models the dynamic geometry and appearance effects, e.g. wrinkles and shadows on the shirt.
  • Figure 3: Qualitative Comparisons. We compare our method to recent animatible habermann2021real and sparse image-driven remelli2022drivableshetty2023holoported methods in terms of novel view synthesis on three testing sequences showing different subjects. As none of these methods is able to predict the skeletal pose from egocentric video, we provide our pose estimate for a fair comparison. For image-driven methods, we supply the egocentric video as driving signal. Due to the different underlying 3D representation, we do not perform post-processing, i.e. head avatar exchange, for baseline methods. However, we apply a semi-transparent mask on the region we exclude from quantitative comparison. We highlight the clear improvement in terms of visual quality that our method can achieve compared to prior works, which primarily stems from our carefully designed character representation (see Sec. \ref{['sec:charmodel']}, \ref{['sec-ddc-geo']}, \ref{['sec-testtime']}, and \ref{['sec-dynamic-texture']}). We increase the brightness of subject 3 for better visualization.
  • Figure 4: Ablation Study of our IKSolver. Without our regularization term $E_\mathrm{Reg}$, our IKSolver (see Sec. \ref{['sec-ik']}) might converge to twisted angles along the longitudinal bone axis. While such poses may perfectly describe the 3D joint detections, they typically lead to high mesh distortions (see insets). Our simple, yet effective, regularization prevents such cases and steers the optimization towards a better solution leading to significantly reduced mesh distortions.
  • Figure 5: Ablation Study of our EgoDeformer. We render our result in the egocentric view and overlay it with the ground truth segmentation mask. Note that after our proposed refinement step, the avatar overlays significantly better with the ground truth. Thus, our final avatar more faithfully reflects the true driving signal.
  • ...and 3 more figures