Table of Contents
Fetching ...

Monocular and Generalizable Gaussian Talking Head Animation

Shengjie Gong, Haojie Li, Jiapeng Tang, Dongming Hu, Shuangping Huang, Hao Chen, Tianshui Chen, Zhuoman Liu

TL;DR

This work introduces Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training, and surpasses previous state-of-the-art methods.

Abstract

In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.

Monocular and Generalizable Gaussian Talking Head Animation

TL;DR

This work introduces Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training, and surpasses previous state-of-the-art methods.

Abstract

In this work, we introduce Monocular and Generalizable Gaussian Talking Head Animation (MGGTalk), which requires monocular datasets and generalizes to unseen identities without personalized re-training. Compared with previous 3D Gaussian Splatting (3DGS) methods that requires elusive multi-view datasets or tedious personalized learning/inference, MGGtalk enables more practical and broader applications. However, in the absence of multi-view and personalized training data, the incompleteness of geometric and appearance information poses a significant challenge. To address these challenges, MGGTalk explores depth information to enhance geometric and facial symmetry characteristics to supplement both geometric and appearance features. Initially, based on the pixel-wise geometric information obtained from depth estimation, we incorporate symmetry operations and point cloud filtering techniques to ensure a complete and precise position parameter for 3DGS. Subsequently, we adopt a two-stage strategy with symmetric priors for predicting the remaining 3DGS parameters. We begin by predicting Gaussian parameters for the visible facial regions of the source image. These parameters are subsequently utilized to improve the prediction of Gaussian parameters for the non-visible regions. Extensive experiments demonstrate that MGGTalk surpasses previous state-of-the-art methods, achieving superior performance across various metrics.

Paper Structure

This paper contains 35 sections, 10 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: MGGTalk is trained using only monocular datasets and enable generalizing to unseen identities without personalized re-training. Additionally, it supports real-time generation of talking heads in diverse poses and novel viewpoints.
  • Figure 2: Comparison of 3DGS-based talking head animation methods, which typically rely on elusive multi-view datasets or tedious personalized learning. Our method achieves generalization to unseen identities while only training on monocular datasets.
  • Figure 3: Pipeline overview of the MGGTalk. Given a source image $\mathbf{I}_s$, we first use semantic parsing to extract the head region $\mathbf{I}_s^{h}$ and torse-background $\mathbf{I}_s^{bg}$. The DGSR module generates point clouds $[\mathbf{P}_f; \mathbf{P}_f^{s}]$ for visible and invisible regions from $\mathbf{I}_s^{h}$. Expression features from the driving image or audio are used by the Deformation Network to edit the point cloud, resulting in $[\mathbf{P}_{d}; \mathbf{P}_{d}^{s}]$. The SGP module then takes the identity encoding $\mathbf{F}$ from $\mathbf{I}_s^{h}$ and the deformed point cloud $[\mathbf{P}_{d}; \mathbf{P}_{d}^{s}]$ to predict the complete Gaussian parameters $\mathcal{G}_{den}$. Finally, $\mathcal{G}_{den}$ is rendered and composited with torso-background $\mathbf{I}_s^{bg}$ to obtain the target $\mathbf{I}_{tat}$.
  • Figure 4: Qualitative comparisons with previous video-driven methods on the HDTF zhang2021flow and NeRSemble-Mono kirschstein2023NeRSemble dataset. The first two rows show the cross-identity driving results on the HDTF dataset, while the third and fourth rows present the results on the NeRSemble-Mono dataset. The last row shows the results of in-the-wild data. To demonstrate the multi-view consistency of our generated results, the last three columns display the fixed viewpoints at $-30^\circ$, $0^\circ$ and $+30^\circ$.
  • Figure 5: Qualitative comparisons with previous audio-driven methods on the HDTF zhang2021flow dataset. The last three columns display the fixed viewpoints at $-30^\circ$, $0^\circ$ and $+30^\circ$.
  • ...and 8 more figures