Table of Contents
Fetching ...

PAV: Personalized Head Avatar from Unstructured Video Collection

Akin Caliskan, Berkay Kicanaoglu, Hyeongwoo Kim

TL;DR

PAV addresses the challenge of building a personalized head avatar from unstructured monocular videos that show the same subject with multiple appearances. It introduces a single unified dynamic deformable NeRF conditioned on per-appearance latent features attached to a geometry-aware head mesh, leveraging a shared canonical space and appearance-conditioned density and color fields. The approach uses a FLAME-based head model for geometry, learns latent appearance embeddings $Z_j$, and employs a density offset $\Delta_{\sigma}$ to capture appearance-specific geometry and texture, achieving superior novel-pose/novel-expression renderings across appearances. This enables realistic, controllable head avatars from unconstrained videos, with broad implications for telepresence and animation, while acknowledging limitations for multi-identity scaling and ethical concerns around misuse.

Abstract

We propose PAV, Personalized Head Avatar for the synthesis of human faces under arbitrary viewpoints and facial expressions. PAV introduces a method that learns a dynamic deformable neural radiance field (NeRF), in particular from a collection of monocular talking face videos of the same character under various appearance and shape changes. Unlike existing head NeRF methods that are limited to modeling such input videos on a per-appearance basis, our method allows for learning multi-appearance NeRFs, introducing appearance embedding for each input video via learnable latent neural features attached to the underlying geometry. Furthermore, the proposed appearance-conditioned density formulation facilitates the shape variation of the character, such as facial hair and soft tissues, in the radiance field prediction. To the best of our knowledge, our approach is the first dynamic deformable NeRF framework to model appearance and shape variations in a single unified network for multi-appearances of the same subject. We demonstrate experimentally that PAV outperforms the baseline method in terms of visual rendering quality in our quantitative and qualitative studies on various subjects.

PAV: Personalized Head Avatar from Unstructured Video Collection

TL;DR

PAV addresses the challenge of building a personalized head avatar from unstructured monocular videos that show the same subject with multiple appearances. It introduces a single unified dynamic deformable NeRF conditioned on per-appearance latent features attached to a geometry-aware head mesh, leveraging a shared canonical space and appearance-conditioned density and color fields. The approach uses a FLAME-based head model for geometry, learns latent appearance embeddings , and employs a density offset to capture appearance-specific geometry and texture, achieving superior novel-pose/novel-expression renderings across appearances. This enables realistic, controllable head avatars from unconstrained videos, with broad implications for telepresence and animation, while acknowledging limitations for multi-identity scaling and ethical concerns around misuse.

Abstract

We propose PAV, Personalized Head Avatar for the synthesis of human faces under arbitrary viewpoints and facial expressions. PAV introduces a method that learns a dynamic deformable neural radiance field (NeRF), in particular from a collection of monocular talking face videos of the same character under various appearance and shape changes. Unlike existing head NeRF methods that are limited to modeling such input videos on a per-appearance basis, our method allows for learning multi-appearance NeRFs, introducing appearance embedding for each input video via learnable latent neural features attached to the underlying geometry. Furthermore, the proposed appearance-conditioned density formulation facilitates the shape variation of the character, such as facial hair and soft tissues, in the radiance field prediction. To the best of our knowledge, our approach is the first dynamic deformable NeRF framework to model appearance and shape variations in a single unified network for multi-appearances of the same subject. We demonstrate experimentally that PAV outperforms the baseline method in terms of visual rendering quality in our quantitative and qualitative studies on various subjects.
Paper Structure (22 sections, 6 equations, 6 figures, 3 tables)

This paper contains 22 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: We present an approach for learning personalized head avatars from an unstructured video collection of the same subject. The videos are taken in different environments, with diverse appearances and possible shape changes of a character, as shown on the left. Our method leverages all of the input clips into a single unified dynamic deformable NeRF framework and offers high-quality video replays in novel head poses and facial expressions. On the right, we show some rendering results of each input appearance of the same subject aligned in head pose and facial expression that are not seen during training. Notably, our method can synthesize identity-specific details while showing coherent expressions across different appearances.
  • Figure 2: Overview of the proposed approach, PAV. (a) shows some training videos in various appearances of the same subject. (b) We estimate the 3D head geometry of the person using FLAME head model. (c) shows the proposed personalized deformable neural radiance field from video collection (d) We learn appearance-specific neural features and attach them to vertices. (d) This feature is passed to fully fused multi-layer perceptrons with additional conditioning on the facial expressions $e$ and the encoded view direction $d$.
  • Figure 3: Qualitative Ablation Study on VidCol Dataset. This figure shows synthesized images and pixel error maps against ground-truth. Please note that brighter pixel colors denote higher pixel errors. Without appearance embedding with latent neural features and appearance-conditioned density offset (a), the model fails to learn appearance-based distinct details. The latent neural features (LNF) (c) resolve the issue of high-fidelity rendering of texture (blue arrow). The appearance-conditioned density offset field (d) allows the model to learn more accurate geometry and texture that corresponds to the face, hair, and neck regions (red arrow)
  • Figure 4: Qualitative comparison of our method (PAV) against baseline method (Insta zielonka2023instant. Columns {2,6} show target head poses and facial expressions for synthesized images at columns {3,4,7,8}.
  • Figure 5: Qualitative comparison of PAV against single-appearance traines Insta zielonka2023instant. Please note that the reference image is for photoconsistency check only since there is no ground truth image for that pose/expression
  • ...and 1 more figures