Generative Head-Mounted Camera Captures for Photorealistic Avatars

Shaojie Bai; Seunghyeon Seo; Yida Wang; Chenghui Li; Owen Wang; Te-Li Wang; Tianyang Ma; Jason Saragih; Shih-En Wei; Nojun Kwak; Hyung Jun Kim

Generative Head-Mounted Camera Captures for Photorealistic Avatars

Shaojie Bai, Seunghyeon Seo, Yida Wang, Chenghui Li, Owen Wang, Te-Li Wang, Tianyang Ma, Jason Saragih, Shih-En Wei, Nojun Kwak, Hyung Jun Kim

TL;DR

GenHMC addresses the HMC-avatar correspondence bottleneck by learning a diffusion-based generator trained on large unpaired HMC datasets to synthesize HMC images conditioned on expression cues extracted from dome captures. This approach eliminates the need for paired HMC-dome data, improves disentanglement between expression and appearance, and generalizes to unseen identities. It also enables training of universal facial encoders with a mix of real and GenHMC-generated data, yielding better data efficiency and state-of-the-art accuracy on held-out subjects. The work demonstrates scalable, photorealistic avatar synthesis for VR/AR and introduces extensions for glasses, multi-view consistency, and lighting control that broaden practical deployment.

Abstract

Enabling photorealistic avatar animations in virtual and augmented reality (VR/AR) has been challenging because of the difficulty of obtaining ground truth state of faces. It is physically impossible to obtain synchronized images from head-mounted cameras (HMC) sensing input, which has partial observations in infrared (IR), and an array of outside-in dome cameras, which have full observations that match avatars' appearance. Prior works relying on analysis-by-synthesis methods could generate accurate ground truth, but suffer from imperfect disentanglement between expression and style in their personalized training. The reliance of extensive paired captures (HMC and dome) for the same subject makes it operationally expensive to collect large-scale datasets, which cannot be reused for different HMC viewpoints and lighting. In this work, we propose a novel generative approach, Generative HMC (GenHMC), that leverages large unpaired HMC captures, which are much easier to collect, to directly generate high-quality synthetic HMC images given any conditioning avatar state from dome captures. We show that our method is able to properly disentangle the input conditioning signal that specifies facial expression and viewpoint, from facial appearance, leading to more accurate ground truth. Furthermore, our method can generalize to unseen identities, removing the reliance on the paired captures. We demonstrate these breakthroughs by both evaluating synthetic HMC images and universal face encoders trained from these new HMC-avatar correspondences, which achieve better data efficiency and state-of-the-art accuracy.

Generative Head-Mounted Camera Captures for Photorealistic Avatars

TL;DR

Abstract

Generative Head-Mounted Camera Captures for Photorealistic Avatars

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)