Table of Contents
Fetching ...

LightAvatar: Efficient Head Avatar as Dynamic Neural Light Field

Huan Wang, Feitong Tan, Ziqian Bai, Yinda Zhang, Shichen Liu, Qiangeng Xu, Menglei Chai, Anish Prabhu, Rohit Pandey, Sean Fanello, Zeng Huang, Yun Fu

TL;DR

LightAvatar is introduced, the first head avatar model based on neural light fields (NeLFs), and can achieve new SOTA image quality quantitatively or qualitatively, while being significantly faster than the counterparts, reporting 174.1 FPS on a consumer-grade GPU with no customized optimization.

Abstract

Recent works have shown that neural radiance fields (NeRFs) on top of parametric models have reached SOTA quality to build photorealistic head avatars from a monocular video. However, one major limitation of the NeRF-based avatars is the slow rendering speed due to the dense point sampling of NeRF, preventing them from broader utility on resource-constrained devices. We introduce LightAvatar, the first head avatar model based on neural light fields (NeLFs). LightAvatar renders an image from 3DMM parameters and a camera pose via a single network forward pass, without using mesh or volume rendering. The proposed approach, while being conceptually appealing, poses a significant challenge towards real-time efficiency and training stability. To resolve them, we introduce dedicated network designs to obtain proper representations for the NeLF model and maintain a low FLOPs budget. Meanwhile, we tap into a distillation-based training strategy that uses a pretrained avatar model as teacher to synthesize abundant pseudo data for training. A warping field network is introduced to correct the fitting error in the real data so that the model can learn better. Extensive experiments suggest that our method can achieve new SOTA image quality quantitatively or qualitatively, while being significantly faster than the counterparts, reporting 174.1 FPS (512x512 resolution) on a consumer-grade GPU (RTX3090) with no customized optimization.

LightAvatar: Efficient Head Avatar as Dynamic Neural Light Field

TL;DR

LightAvatar is introduced, the first head avatar model based on neural light fields (NeLFs), and can achieve new SOTA image quality quantitatively or qualitatively, while being significantly faster than the counterparts, reporting 174.1 FPS on a consumer-grade GPU with no customized optimization.

Abstract

Recent works have shown that neural radiance fields (NeRFs) on top of parametric models have reached SOTA quality to build photorealistic head avatars from a monocular video. However, one major limitation of the NeRF-based avatars is the slow rendering speed due to the dense point sampling of NeRF, preventing them from broader utility on resource-constrained devices. We introduce LightAvatar, the first head avatar model based on neural light fields (NeLFs). LightAvatar renders an image from 3DMM parameters and a camera pose via a single network forward pass, without using mesh or volume rendering. The proposed approach, while being conceptually appealing, poses a significant challenge towards real-time efficiency and training stability. To resolve them, we introduce dedicated network designs to obtain proper representations for the NeLF model and maintain a low FLOPs budget. Meanwhile, we tap into a distillation-based training strategy that uses a pretrained avatar model as teacher to synthesize abundant pseudo data for training. A warping field network is introduced to correct the fitting error in the real data so that the model can learn better. Extensive experiments suggest that our method can achieve new SOTA image quality quantitatively or qualitatively, while being significantly faster than the counterparts, reporting 174.1 FPS (512x512 resolution) on a consumer-grade GPU (RTX3090) with no customized optimization.
Paper Structure (25 sections, 10 equations, 11 figures, 5 tables)

This paper contains 25 sections, 10 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: (a) Overview comparison between existing neural head avatars (top) our LightAvatar (down) -- a brand-new framework to build efficient 3D head avatars based on neural light field. LightAvatar features a simple and uniform design, which takes expression code and camera pose as input, renders the RGB via a single network forward pass, running at 174.1 FPS (on a RTX3090 GPU) with image quality improved. (b) FPS and LPIPS comparison of recent top-performing (fast) avatars. Our method achieves much faster rendering speed with better LPIPS than the counterparts.
  • Figure 2: Overview of our LightAvatar. The method consists of four trainable networks (spatial attention network, local feature network, NeLF network, and SR network). (1) Given an expression code, the local feature network transforms it to a local feature bank, which stores the features of different local head regions. (2) Given a specific ray and the expression code, the spatial attention network outputs a vector of spatial attention weights to query the local feature bank to obtain the expression representation for that ray. (3) Then, given the ray and expression representation as input, the NeLF model predicts the desired (low-resolution) RGB. (4) Finally, the SR network generates a high-resolution image with the low-resolution image as input. Notably, LightAvatar predicts the target RGB via a single network forward, thus enabling fast rendering.
  • Figure 3: Visual comparison on the test set with prior top-performing monocular head avatars. From top to down, the subject is Subject0 to Subject4 in order (see another 3 in supplementary material). Our LightAvatar method faithfully predicts the facial expressions and presents sharper high-frequency details than other approaches.
  • Figure 4: Visual comparison with recent fast avatars. NBS: NeRFBlendShape gao2022reconstructing, PA: PointAvatar zheng2023pointavatar, INSTA zielonka2023instant. Top to down: Subject 8 to Subject 12.
  • Figure 5: (a) Results of our method on Subject13 with the shoulder. For reference, the average LPIPS/SSIM/PSNR of Monoavatar on test set: 0.118/0.846/26.11; ours: 0.107/0.849/26.29. (b) Comparison between joint modeling and separate modeling (Sec. \ref{['subsec:lightavatar']}) when learning the shoulder in our method.
  • ...and 6 more figures