Table of Contents
Fetching ...

One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation

Zhixuan Yu, Ziqian Bai, Abhimitra Meka, Feitong Tan, Qiangeng Xu, Rohit Pandey, Sean Fanello, Hyun Soo Park, Yinda Zhang

TL;DR

This paper learns a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and uses it as a prior for creating personalized avatar from few-shot images.

Abstract

Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and leverage it as a prior for creating personalized avatar from few-shot images. Different from previous 3D-aware face generative models, our prior is built with a 3DMM-anchored neural radiance field backbone, which we show to be more effective for avatar creation through auto-decoding based on few-shot inputs. We also handle unstable 3DMM fitting by jointly optimizing the 3DMM fitting and camera calibration that leads to better few-shot adaptation. Our method demonstrates compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation.

One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation

TL;DR

This paper learns a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and uses it as a prior for creating personalized avatar from few-shot images.

Abstract

Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and leverage it as a prior for creating personalized avatar from few-shot images. Different from previous 3D-aware face generative models, our prior is built with a 3DMM-anchored neural radiance field backbone, which we show to be more effective for avatar creation through auto-decoding based on few-shot inputs. We also handle unstable 3DMM fitting by jointly optimizing the 3DMM fitting and camera calibration that leads to better few-shot adaptation. Our method demonstrates compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation.
Paper Structure (23 sections, 4 equations, 9 figures, 2 tables)

This paper contains 23 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: We present a new approach to generate an animatable photo-realistic avatar from only a few or even one image of the target person. We encode the geometry and appearance by leveraging a neural radiance field that is generated by a 3D generative model learned from multi-view multi-expression data. With this model, we render the high fidelity head avatar seen from a novel view.
  • Figure 2: We adopt a 3DMM-anchored neural radiance field (NeRF) as our avatar representation, where the feature for each query point is aggregated from its k-Nearest-Neighbors in the 3DMM vertices and decoded to color and density via a shallow MLP netowrk. The StyleGAN2 stylegan2 generator based identity branch encodes personalized characteristics into an identity feature map from a latent code uniquely assigned to a training subject. The expression branch produces an expression feature map from 3DMM expression code via a U-Net. The summation of two feature maps are then sampled by 3DMM vertices via texture coordinates to establish the 3DMM-anchored neural radiance field. For few-shot adaptation, we initialize the target latent code with the mean latent code across training subjects and jointly optimize it with the model weights as well as per-frame 3DMM fitting corrections on given images of the target person.
  • Figure 3: New subject adaptation for different methods as varying % of training data. Our proposed method consistently outperforms the state-of-the-art approaches, particularly at the low data regime (e.g., one image). Even as the amount of training data increases, our method maintains its superior performance.
  • Figure 4: Cross-method comparison against the training time (left) and the test images grouped in five bins: expressions of images in bins with larger index are less similar to the training image (right).
  • Figure 5: Avatar generation in novel identity by interpolating latent code of two subjects. We compare the proposed approach with a Tri-planes Backbone (TP) and with a model trained on a single view (SV), showing how we can generate higher quality renderings of subjects and smoother interpolations.
  • ...and 4 more figures