Table of Contents
Fetching ...

DEGSTalk: Decomposed Per-Embedding Gaussian Fields for Hair-Preserving Talking Face Synthesis

Kaijun Deng, Dezhi Zheng, Jindong Xie, Jinbao Wang, Weicheng Xie, Linlin Shen, Siyang Song

TL;DR

DEGSTalk tackles hair-preserving talking-face synthesis by integrating Deformable Pre-Embedding Gaussian Fields within a 3D Gaussian Splatting framework and a Dynamic Hair-Preserving Rendering pipeline. It predicts per-Gaussian deformations from audio features and implicit 3DMM coefficients through a tri-plane hash encoder and an MLP, producing deformed Gaussian parameters for rendering. A hair-aware fusion strategy preserves long-hair dynamics while maintaining facial realism, trained via a three-stage optimization that jointly enforces geometric fidelity and perceptual quality. On six portrait videos, DEGSTalk achieves state-of-the-art fidelity and near real-time rendering, demonstrating strong performance and hair preservation, with some noisy primitives remaining as future work.

Abstract

Accurately synthesizing talking face videos and capturing fine facial features for individuals with long hair presents a significant challenge. To tackle these challenges in existing methods, we propose a decomposed per-embedding Gaussian fields (DEGSTalk), a 3D Gaussian Splatting (3DGS)-based talking face synthesis method for generating realistic talking faces with long hairs. Our DEGSTalk employs Deformable Pre-Embedding Gaussian Fields, which dynamically adjust pre-embedding Gaussian primitives using implicit expression coefficients. This enables precise capture of dynamic facial regions and subtle expressions. Additionally, we propose a Dynamic Hair-Preserving Portrait Rendering technique to enhance the realism of long hair motions in the synthesized videos. Results show that DEGSTalk achieves improved realism and synthesis quality compared to existing approaches, particularly in handling complex facial dynamics and hair preservation. Our code will be publicly available at https://github.com/CVI-SZU/DEGSTalk.

DEGSTalk: Decomposed Per-Embedding Gaussian Fields for Hair-Preserving Talking Face Synthesis

TL;DR

DEGSTalk tackles hair-preserving talking-face synthesis by integrating Deformable Pre-Embedding Gaussian Fields within a 3D Gaussian Splatting framework and a Dynamic Hair-Preserving Rendering pipeline. It predicts per-Gaussian deformations from audio features and implicit 3DMM coefficients through a tri-plane hash encoder and an MLP, producing deformed Gaussian parameters for rendering. A hair-aware fusion strategy preserves long-hair dynamics while maintaining facial realism, trained via a three-stage optimization that jointly enforces geometric fidelity and perceptual quality. On six portrait videos, DEGSTalk achieves state-of-the-art fidelity and near real-time rendering, demonstrating strong performance and hair preservation, with some noisy primitives remaining as future work.

Abstract

Accurately synthesizing talking face videos and capturing fine facial features for individuals with long hair presents a significant challenge. To tackle these challenges in existing methods, we propose a decomposed per-embedding Gaussian fields (DEGSTalk), a 3D Gaussian Splatting (3DGS)-based talking face synthesis method for generating realistic talking faces with long hairs. Our DEGSTalk employs Deformable Pre-Embedding Gaussian Fields, which dynamically adjust pre-embedding Gaussian primitives using implicit expression coefficients. This enables precise capture of dynamic facial regions and subtle expressions. Additionally, we propose a Dynamic Hair-Preserving Portrait Rendering technique to enhance the realism of long hair motions in the synthesized videos. Results show that DEGSTalk achieves improved realism and synthesis quality compared to existing approaches, particularly in handling complex facial dynamics and hair preservation. Our code will be publicly available at https://github.com/CVI-SZU/DEGSTalk.
Paper Structure (12 sections, 9 equations, 3 figures, 2 tables)

This paper contains 12 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of DEGSTalk. Given a cropped reference video of a talking face and its corresponding speech, our DEGSTalk first extracts the audio feature $f_a$ and performs 3D face reconstruction retsinas20243d to obtain 3DMM coefficients, including identity, shape, expression, eyelid movement, and jaw. Secondly, Gaussian primitives are pre-embedded to construct the deformable pre-emebdding Gaussian fields and then optimize the coarse static fields of the face and mouth from random point clouds. These deformable Gaussian fields predict transformations in position, scale, and rotation. Then, the 3DGS rasterizer deforms and renders the pre-embedding Gaussian primitives into 2D images from the camera perspective for face and mouth. Finally, these regions are fused to synthesize the final talking face video using a dynamic hair-preserving portrait rendering.
  • Figure 2: Visual results of the comparative experiments. We show the generated results of the baselines under the head reconstruction setting and the ground truth.
  • Figure 3: Visual results of the ablation Study.