Table of Contents
Fetching ...

SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians

Liam Schoneveld, Zhe Chen, Davide Davoli, Jiapeng Tang, Saimon Terazawa, Ko Nishino, Matthias Nießner

TL;DR

This work tackles real-time 3D head reconstruction from monocular imagery without 3D supervision. It introduces SHeaP, which jointly predicts $3DMM$ parameters and a set of Gaussians bound to the mesh, using 2D Gaussian Splatting to enable strong photometric supervision. The method employs a UV-map generator and a graph CNN to densify/prune Gaussians, a refined binding between Gaussians and the $3DMM$, and a lighting/shading model based on $SH$ priors, all guided by self-supervised losses that couple geometry to appearance. Results show state-of-the-art performance among 2D-supervised methods on NoW and Nersemble, with improved expressive geometry and emotion-prediction capabilities on AffectNet, highlighting the practical impact for scalable, expressive head avatars.

Abstract

Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.

SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians

TL;DR

This work tackles real-time 3D head reconstruction from monocular imagery without 3D supervision. It introduces SHeaP, which jointly predicts parameters and a set of Gaussians bound to the mesh, using 2D Gaussian Splatting to enable strong photometric supervision. The method employs a UV-map generator and a graph CNN to densify/prune Gaussians, a refined binding between Gaussians and the , and a lighting/shading model based on priors, all guided by self-supervised losses that couple geometry to appearance. Results show state-of-the-art performance among 2D-supervised methods on NoW and Nersemble, with improved expressive geometry and emotion-prediction capabilities on AffectNet, highlighting the practical impact for scalable, expressive head avatars.

Abstract

Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.

Paper Structure

This paper contains 30 sections, 17 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Our method, named SHeaP, instantly predicts accurate human head geometry from a single image. First row: input images; second row: predicted meshes; bottom row: rendered predicted Gaussians.
  • Figure 2: Overview of SHeaP. At each training step, we sample a source image $I_\textit{source}$ and a target image $I_\textit{target}$. These are both passed through the same vision transformer (ViT), which predicts 3DMM parameters shape $\bm{\beta}$, pose $\bm{\theta}$ and expression $\bm{\psi}$, plus an environment lighting latent $\bm{\ell}$ and identity features $\bm{f}$. A Gaussians Regressor takes $\bm{f}$ as input, along with DINOv2 dinov2 features $\mathbf{d}$. The Gaussians Regressor predicts a set of Gaussians $\mathcal{G}$, which are bound to the predicted 3DMM mesh and rendered with 2DGS to produce $\hat{I}_\textit{target}$. Finally, photometric losses between $\hat{I}_\textit{target}$ and $I_\textit{target}$ are backpropagated to the ViT and Gaussians Regressor parameters, as well as additional losses based on rendered depth, normals, and landmarks.
  • Figure 3: Architecture of the Gaussians generator. In the illustrated case, the first two Gaussians have the same parent face: $p_1 = p_2$ and thus their learned embeddings $\bm{e}_1, \bm{e}_2$ are concatenated with the same region features, $\mathbf{r}_1$.
  • Figure 4: Comparison to other one-shot reconstruction methods, from left to right: Ours, AlbedoGAN albedogan, SMIRK smirk, EMOCA EMOCA, DECA DECA.