Table of Contents
Fetching ...

Fast Registration of Photorealistic Avatars for VR Facial Animation

Chaitanya Patel, Shaojie Bai, Te-Li Wang, Jason Saragih, Shih-En Wei

TL;DR

This work tackles fast, high-fidelity registration of photorealistic VR avatars using headset-mounted infrared images, addressing a core domain gap between IR camera data and avatar renderings. It decouples the problem into a transformer-based iterative refinement module and an avatar-conditioned image-to-image style transfer module, enabling online, identity-generalizable registration without costly offline optimization. The approach shows superior online performance over direct regression and approaches offline results while offering real-time applicability, validated on a large, multi-identity dataset and released publicly. The key contribution is a generic, two-module framework that mutually reinforces domain adaptation and pose-expression estimation, with detailed ablations and architectural disclosures to spur further research. This has practical impact for immersive VR telepresence and adaptive real-time avatar animation.

Abstract

Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.

Fast Registration of Photorealistic Avatars for VR Facial Animation

TL;DR

This work tackles fast, high-fidelity registration of photorealistic VR avatars using headset-mounted infrared images, addressing a core domain gap between IR camera data and avatar renderings. It decouples the problem into a transformer-based iterative refinement module and an avatar-conditioned image-to-image style transfer module, enabling online, identity-generalizable registration without costly offline optimization. The approach shows superior online performance over direct regression and approaches offline results while offering real-time applicability, validated on a large, multi-identity dataset and released publicly. The key contribution is a generic, two-module framework that mutually reinforces domain adaptation and pose-expression estimation, with detailed ablations and architectural disclosures to spur further research. This has practical impact for immersive VR telepresence and adaptive real-time avatar animation.

Abstract

Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.
Paper Structure (23 sections, 5 equations, 14 figures, 3 tables)

This paper contains 23 sections, 5 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: On consumer VR headsets, oblique mouth views and a large image domain gap hinder high quality face registration. As shown, the subtle lip shapes and jaw movement are often hardly observed. Under this setting, our method is capable of efficiently and accurately registering facial expression and head pose of the photorealisitic avatars chen2022instant of unseen identities.
  • Figure 2: Examples of HMC images and corresponding ground truth expression rendered on their avatars from the offline registration method schwartz2020eyes, which utilizes augmented cameras with better frontal views (highlighted in green). In this work, we aim to efficiently register faces using cameras on consumer headsets, which only have oblique views (highlighted in red). In such views, information about subtle expressions (e.g., lip movements) are often covered by very few pixels or even not visible.
  • Figure 3: Overview of the method. We decouple the problem into an avatar-conditioned image-to-image style transfer module $\mathcal{S}$ and a iterative refinement module $\mathcal{F}$. Module $\mathcal{F}_0$ initializes both modules by directly esimating on HMC input $\boldsymbol{H}$.
  • Figure 4: Architecture of iterative refinement module $\mathcal{F}$
  • Figure 5: Progression of iterative refinement in $\mathcal{F}$: we show intermediate results $\mathcal{D}(\boldsymbol{z}_{t}, \boldsymbol{v}_{t})$ and corresponding error maps for each step $t$.
  • ...and 9 more figures