Table of Contents
Fetching ...

URAvatar: Universal Relightable Gaussian Codec Avatars

Junxuan Li, Chen Cao, Gabriel Schwartz, Rawal Khirodkar, Christian Richardt, Tomas Simon, Yaser Sheikh, Shunsuke Saito

TL;DR

A universal relightable avatar model represented by 3D Gaussians is built that incorporates global light transport in an efficient manner for real-time rendering and outperforming existing approaches while retaining real-time rendering capability.

Abstract

We present a new approach to creating photorealistic and relightable head avatars from a phone scan with unknown illumination. The reconstructed avatars can be animated and relit in real time with the global illumination of diverse environments. Unlike existing approaches that estimate parametric reflectance parameters via inverse rendering, our approach directly models learnable radiance transfer that incorporates global light transport in an efficient manner for real-time rendering. However, learning such a complex light transport that can generalize across identities is non-trivial. A phone scan in a single environment lacks sufficient information to infer how the head would appear in general environments. To address this, we build a universal relightable avatar model represented by 3D Gaussians. We train on hundreds of high-quality multi-view human scans with controllable point lights. High-resolution geometric guidance further enhances the reconstruction accuracy and generalization. Once trained, we finetune the pretrained model on a phone scan using inverse rendering to obtain a personalized relightable avatar. Our experiments establish the efficacy of our design, outperforming existing approaches while retaining real-time rendering capability.

URAvatar: Universal Relightable Gaussian Codec Avatars

TL;DR

A universal relightable avatar model represented by 3D Gaussians is built that incorporates global light transport in an efficient manner for real-time rendering and outperforming existing approaches while retaining real-time rendering capability.

Abstract

We present a new approach to creating photorealistic and relightable head avatars from a phone scan with unknown illumination. The reconstructed avatars can be animated and relit in real time with the global illumination of diverse environments. Unlike existing approaches that estimate parametric reflectance parameters via inverse rendering, our approach directly models learnable radiance transfer that incorporates global light transport in an efficient manner for real-time rendering. However, learning such a complex light transport that can generalize across identities is non-trivial. A phone scan in a single environment lacks sufficient information to infer how the head would appear in general environments. To address this, we build a universal relightable avatar model represented by 3D Gaussians. We train on hundreds of high-quality multi-view human scans with controllable point lights. High-resolution geometric guidance further enhances the reconstruction accuracy and generalization. Once trained, we finetune the pretrained model on a phone scan using inverse rendering to obtain a personalized relightable avatar. Our experiments establish the efficacy of our design, outperforming existing approaches while retaining real-time rendering capability.

Paper Structure

This paper contains 26 sections, 13 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Method Overview. (a) We employ a large relightable corpus of multi-view facial performances to train a cross-identity decoder $\mathcal{D}$ that can generate volumetric avatar representations. (b) Given a single phone scan of an unseen identity, we reconstruct the head pose, geometry, and albedo texture, and fine-tune our pretrained relightable prior model. (c) Our final model provides disentangled control over relighting, gaze and neck control.
  • Figure 2: Network architecture. Our expression encoder, ${\mathcal{E}}_{\text{exp}}$, takes a 1024x1024 positional map of face geometry as input and encodes it into an expression latent code with the map size of 4x4. Our downsampling block consists of a convolutional layer with a kernel size of 4 and stride of 2, followed by a leaky ReLU activation function. Similarly, our upsampling block is composed of a transposed convolutional layer with a kernel size of 4 and stride of 2, followed by a leaky ReLU activation function. Our identity encoder, ${\mathcal{E}}_{\text{id}}$, is a U-Net-like architecture that takes the mean texture and geometry of a subject as input, producing a multi-scale feature pyramid as the ID conditioning data. The produced feature maps are then added to the corresponding layer of the decoder to produce the guide mesh and its Gaussian parameters. Our decoder consists of 7 upsampling blocks that take a map with a size of 4x4 as input and output 1024x1024 Gaussian parameter maps.
  • Figure 3: Visualization of the effect of our fitted environment lights, and the comparison to the ground-truth environment lights.
  • Figure 4: Ablation study on unified eye specular visibility decoder and high-resolution tracked mesh.
  • Figure 5: Qualitative comparison of environment relighting between FLARE bharadwaj2023flare and our approach in an unseen test environment.
  • ...and 2 more figures