Table of Contents
Fetching ...

NPGA: Neural Parametric Gaussian Avatars

Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lourdes Agapito, Matthias Nießner

TL;DR

NPGA addresses the challenge of high-fidelity, controllable 3D head avatars from multi-view video by integrating 3D Gaussian Splatting with a neural parametric head model prior (MonoNPHM). It introduces a cycle-consistency distillation to convert MonoNPHM's backward deformation into a forward field compatible with rasterization, and augments the canonical Gaussian representation with per-Gaussian features to boost dynamic expressivity. Regularization via Laplacian terms and an adaptive density control strategy stabilizes training and enables detailed regions such as eyes and teeth to be faithfully rendered. On NeRSemble, NPGA outperforms prior avatars on self-reenactment by 2.6 PSNR and demonstrates accurate monocular animation, indicating strong practical potential for real-time, photorealistic avatars in real-world settings where multi-view data may be unavailable.

Abstract

The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives. Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance. In this work, we propose Neural Parametric Gaussian Avatars (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings. We build our method around 3D Gaussian splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. In contrast to previous work, we condition our avatars' dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we distill the backward deformation field of our underlying NPHM into forward deformations which are compatible with rasterization-based rendering. All remaining fine-scale, expression-dependent details are learned from the multi-view videos. For increased representational capacity of our avatars, we propose per-Gaussian latent features that condition each primitives dynamic behavior. To regularize this increased dynamic expressivity, we propose Laplacian terms on the latent features and predicted dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by 2.6 PSNR. Furthermore, we demonstrate accurate animation capabilities from real-world monocular videos.

NPGA: Neural Parametric Gaussian Avatars

TL;DR

NPGA addresses the challenge of high-fidelity, controllable 3D head avatars from multi-view video by integrating 3D Gaussian Splatting with a neural parametric head model prior (MonoNPHM). It introduces a cycle-consistency distillation to convert MonoNPHM's backward deformation into a forward field compatible with rasterization, and augments the canonical Gaussian representation with per-Gaussian features to boost dynamic expressivity. Regularization via Laplacian terms and an adaptive density control strategy stabilizes training and enables detailed regions such as eyes and teeth to be faithfully rendered. On NeRSemble, NPGA outperforms prior avatars on self-reenactment by 2.6 PSNR and demonstrates accurate monocular animation, indicating strong practical potential for real-time, photorealistic avatars in real-world settings where multi-view data may be unavailable.

Abstract

The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives. Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance. In this work, we propose Neural Parametric Gaussian Avatars (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings. We build our method around 3D Gaussian splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. In contrast to previous work, we condition our avatars' dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we distill the backward deformation field of our underlying NPHM into forward deformations which are compatible with rasterization-based rendering. All remaining fine-scale, expression-dependent details are learned from the multi-view videos. For increased representational capacity of our avatars, we propose per-Gaussian latent features that condition each primitives dynamic behavior. To regularize this increased dynamic expressivity, we propose Laplacian terms on the latent features and predicted dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by 2.6 PSNR. Furthermore, we demonstrate accurate animation capabilities from real-world monocular videos.
Paper Structure (38 sections, 13 equations, 7 figures, 2 tables)

This paper contains 38 sections, 13 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Method Overview: The basis of our avatar optimization are multi-view video recordings alongside a MonoNPHM tracking thereof, see (a). Next, we extract a forward-deformation prior $\mathcal{F}$ from MonoNPHM's backward deformation field $\mathcal{B}$ using a cycle-consistency loss, see (b). Our avatars consist of a canonical Gaussian point cloud (c), which is warped into posed space using our dynamics module $\mathcal{D}$, consisting of the coarse pre-trained component $\mathcal{F}$ and a detail network $\mathcal{G}$. We condition both networks on per Gaussian features, which dictate each primitive's behavior. After rendering the avatar with 3DGS, we employ a screen-space CNN to suppress small-scale artifacts.
  • Figure 2: Self-Reenactment: Qualitative comparison of different methods on the held-out sequence.
  • Figure 3: Cross-Reenactment: Qualitative comparison of transferring a driving expression from a different identity (left) to an avatar.
  • Figure 4: Ablation Study: Without utilizing per Gaussians features ("Vanilla"), the avatars fail to represent fine expression details and complicated regions like the eyes and bottom teeth. Adding per Gaussian features (p.G.F.) results in significantly sharper reconstructions but is prone to artifacts under extreme expressions. Adding our Laplacian regularization ("+Lap. smoothness") and a screen-space CNN ("Ours") finally resolves all artifacts. Furthermore, "Ours-ADC" demonstrates that the default densification strategy inhibits detailed reconstructions.
  • Figure 5: Real-World Application: We utilize the monocular RGB tracking from MonoNPHM to animate our high-fidelity avatars, demonstrating the applicability of our avatars outside of multi-view capture studios.
  • ...and 2 more figures