Table of Contents
Fetching ...

Tri$^{2}$-plane: Thinking Head Avatar via Feature Pyramid

Luchuan Song, Pinxin Liu, Lele Chen, Guojun Yin, Chenliang Xu

TL;DR

Tri$^2$-plane tackles the loss of high-frequency detail in monocular head avatar reconstruction by introducing a multi-scale feature-pyramid framework built on three cascaded tri-planes, enabling progressive global-to-local refinement of facial features. A geometry-aware sliding window augments training to improve robustness across arbitrary camera viewpoints and cross-identity reenactment, complemented by a super-resolution module. Quantitative and qualitative evaluations show the approach outperforms state-of-the-art methods on self-/cross-reenactment tasks in metrics such as F-LMD, SD, PSNR, and LPIPS, with superior texture and hair detail preservation. The method offers a practical, plug-in augmentation for NeRF-based facial reconstruction and strengthens the realism and consistency of monocular head avatars, while acknowledging limitations and emphasizing responsible deployment.

Abstract

Recent years have witnessed considerable achievements in facial avatar reconstruction with neural volume rendering. Despite notable advancements, the reconstruction of complex and dynamic head movements from monocular videos still suffers from capturing and restoring fine-grained details. In this work, we propose a novel approach, named Tri$^2$-plane, for monocular photo-realistic volumetric head avatar reconstructions. Distinct from the existing works that rely on a single tri-plane deformation field for dynamic facial modeling, the proposed Tri$^2$-plane leverages the principle of feature pyramids and three top-to-down lateral connections tri-planes for details improvement. It samples and renders facial details at multiple scales, transitioning from the entire face to specific local regions and then to even more refined sub-regions. Moreover, we incorporate a camera-based geometry-aware sliding window method as an augmentation in training, which improves the robustness beyond the canonical space, with a particular improvement in cross-identity generation capabilities. Experimental outcomes indicate that the Tri$^2$-plane not only surpasses existing methodologies but also achieves superior performance across quantitative and qualitative assessments. The project website is: \url{https://songluchuan.github.io/Tri2Plane.github.io/}.

Tri$^{2}$-plane: Thinking Head Avatar via Feature Pyramid

TL;DR

Tri-plane tackles the loss of high-frequency detail in monocular head avatar reconstruction by introducing a multi-scale feature-pyramid framework built on three cascaded tri-planes, enabling progressive global-to-local refinement of facial features. A geometry-aware sliding window augments training to improve robustness across arbitrary camera viewpoints and cross-identity reenactment, complemented by a super-resolution module. Quantitative and qualitative evaluations show the approach outperforms state-of-the-art methods on self-/cross-reenactment tasks in metrics such as F-LMD, SD, PSNR, and LPIPS, with superior texture and hair detail preservation. The method offers a practical, plug-in augmentation for NeRF-based facial reconstruction and strengthens the realism and consistency of monocular head avatars, while acknowledging limitations and emphasizing responsible deployment.

Abstract

Recent years have witnessed considerable achievements in facial avatar reconstruction with neural volume rendering. Despite notable advancements, the reconstruction of complex and dynamic head movements from monocular videos still suffers from capturing and restoring fine-grained details. In this work, we propose a novel approach, named Tri-plane, for monocular photo-realistic volumetric head avatar reconstructions. Distinct from the existing works that rely on a single tri-plane deformation field for dynamic facial modeling, the proposed Tri-plane leverages the principle of feature pyramids and three top-to-down lateral connections tri-planes for details improvement. It samples and renders facial details at multiple scales, transitioning from the entire face to specific local regions and then to even more refined sub-regions. Moreover, we incorporate a camera-based geometry-aware sliding window method as an augmentation in training, which improves the robustness beyond the canonical space, with a particular improvement in cross-identity generation capabilities. Experimental outcomes indicate that the Tri-plane not only surpasses existing methodologies but also achieves superior performance across quantitative and qualitative assessments. The project website is: \url{https://songluchuan.github.io/Tri2Plane.github.io/}.
Paper Structure (18 sections, 17 equations, 8 figures, 3 tables)

This paper contains 18 sections, 17 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of Tri$^2$-plane. The pipeline steps include four components: (1) parametric facial tracking and zero-pose rendering are applied to generate mean texture and normal maps (shown as Front-View); (2) a facial condition embedding from inputs ($\beta_{t}$, $\gamma_{t}$ and encoded $\mathbf{I}_{t}$); (3) the multiple tri-plane for voxel rendering (as Tri$^2$-plane), accommodating various facial scales while employing shared MLP weights and (4) the resulting images are refined with a super-resolution model (not depicted in the figure). Furthermore, we have introduced the geometry-aware sliding window for training data augmentation to improve robustness, which incorporates the camera parameters ($\mathbf{c}_{I}$,$\mathbf{c}_{E}$) with the tracked translation values to form the training pair.
  • Figure 2: Qualitative comparison of different methods on the front view of the videos from NeRSemble kirschstein2023nersemble under self-reenactment task. From left to right: HAvatar zhao2023havatar, PointAvatar Zheng_2023_CVPR, GaussianHead wang2024gaussianhead and Ours. Our method achieves high-quality reconstruction details such as hair and torso textures. Please zoom in for details.
  • Figure 3: Qualitative comparison on the cross-reenactment. The subjects are collected from (top-down): NeRSemble kirschstein2023nersemble, HAvatar zhao2023havatar, PointAvatar Zheng_2023_CVPR, GaussianHead wang2024gaussianhead and self-recorded video. GT is the corresponding frame in the actor video. We take the official weights (HAvatar zhao2023havatar, PointAvatar Zheng_2023_CVPR) for the subjects in $2^{nd}$ and $3^{rd}$ rows.
  • Figure 4: Visualization of different number of tri-planes. Two novel-view reconstruction results of same facial pose/expression are shown in each pair. The areas of interest have been zoomed in (with red arrow). We ($w/$$\Phi_{512}$) exhibits best details, as evidenced by the glass textures ($w/$$\Phi_{512}$ has three slices, others have two). The nearest neighbor shows the clearest ground truth of glass texture , which is from dataset.
  • Figure 5: Visualization of improvement via geometry aware sliding window. We move camera to sample the canonical appearance from different viewpoints. The sliding window helps prevent artifacts from different views and movements.
  • ...and 3 more figures