Table of Contents
Fetching ...

Generative Head-Mounted Camera Captures for Photorealistic Avatars

Shaojie Bai, Seunghyeon Seo, Yida Wang, Chenghui Li, Owen Wang, Te-Li Wang, Tianyang Ma, Jason Saragih, Shih-En Wei, Nojun Kwak, Hyung Jun Kim

TL;DR

GenHMC addresses the HMC-avatar correspondence bottleneck by learning a diffusion-based generator trained on large unpaired HMC datasets to synthesize HMC images conditioned on expression cues extracted from dome captures. This approach eliminates the need for paired HMC-dome data, improves disentanglement between expression and appearance, and generalizes to unseen identities. It also enables training of universal facial encoders with a mix of real and GenHMC-generated data, yielding better data efficiency and state-of-the-art accuracy on held-out subjects. The work demonstrates scalable, photorealistic avatar synthesis for VR/AR and introduces extensions for glasses, multi-view consistency, and lighting control that broaden practical deployment.

Abstract

Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

Generative Head-Mounted Camera Captures for Photorealistic Avatars

TL;DR

GenHMC addresses the HMC-avatar correspondence bottleneck by learning a diffusion-based generator trained on large unpaired HMC datasets to synthesize HMC images conditioned on expression cues extracted from dome captures. This approach eliminates the need for paired HMC-dome data, improves disentanglement between expression and appearance, and generalizes to unseen identities. It also enables training of universal facial encoders with a mix of real and GenHMC-generated data, yielding better data efficiency and state-of-the-art accuracy on held-out subjects. The work demonstrates scalable, photorealistic avatar synthesis for VR/AR and introduces extensions for glasses, multi-view consistency, and lighting control that broaden practical deployment.

Abstract

Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

Paper Structure

This paper contains 25 sections, 9 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: The training and inference processes of our GenHMC model.(Left) During the training phase, we use the detected keypoint+segmentation annotations as the conditional signals, and self-supervise a conditional diffusion model training with the noise prediction loss as well as the perceptual losses. (Right) At inference time, we obtain the keypoint+segmentation conditional as the (base) control signal for expressions, and sample the diffusion model to generate high-quality synthetic HMC data that aligns with this expression.
  • Figure 2: Directly training universal facial encoder with GenHMC-based HMC-avatar correspondences. Using the GenHMC approach, we can directly establish HMC-avatar correspondences and train on dome capture frames. This is vastly different from the traditional approach which requires an HMC capture paired with the dome capture of the subject.
  • Figure 3: Effect of training size on GenHMC inference results. When the number of training subjects is small, the diversity of the generated output decreases, which can lead to degenerate results when an arbitrary keyseg map is provided as a condition. $S$ denotes the training dataset size, i.e., the number of training subjects.
  • Figure 4: Scalability of GenHMC subjects on UE training. Generally, the more the GenHMC training subjects there are, the more accurate the pixel-level reconstruction quality becomes during downstream UE training. We use all 307 real subjects together for training and evaluate on held-out 34 subjects. To assess the impact of GenHMC data size alone without other priors, we compare encoders trained from scratch without SSL pre-training.
  • Figure 5: Qualitative comparison of GenHMC scalability on UE training. Training UE with only synthetic data (top) improves performance with more diverse GenHMC subjects, while combining real and synthetic data (bottom) yields robust results. For each case, the UE driving result is on the left, and the $L_1$ photometric error map (vs. G.T.) is on the right. Colors closer to blue indicate lower error. Kindly refer to our video material for more results.
  • ...and 10 more figures