Table of Contents
Fetching ...

EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation

Tianheng Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng

TL;DR

EGSTalker tackles real-time, high-fidelity audio-driven talking head generation with limited training data by introducing a two-stage Gaussian-based framework. A spatial structure encoder (multi-resolution hash triplane + KAN) builds a static 3D Gaussian head representation, which is then deformed by an Efficient Gaussian Deformation decoder that fuses audio with spatial cues via ESAA and periodic encoding. The approach achieves lip-sync accuracy and rendering quality comparable to state-of-the-art NeRF/3DGS methods while delivering substantially higher inference speed, enabling real-time applications. Ablation studies confirm the importance of KAN, ESAA, PPE, and static Gaussian initialization for both reconstruction quality and temporal synchronization.

Abstract

This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker's potential for real-time multimedia applications.

EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation

TL;DR

EGSTalker tackles real-time, high-fidelity audio-driven talking head generation with limited training data by introducing a two-stage Gaussian-based framework. A spatial structure encoder (multi-resolution hash triplane + KAN) builds a static 3D Gaussian head representation, which is then deformed by an Efficient Gaussian Deformation decoder that fuses audio with spatial cues via ESAA and periodic encoding. The approach achieves lip-sync accuracy and rendering quality comparable to state-of-the-art NeRF/3DGS methods while delivering substantially higher inference speed, enabling real-time applications. Ablation studies confirm the importance of KAN, ESAA, PPE, and static Gaussian initialization for both reconstruction quality and temporal synchronization.

Abstract

This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker's potential for real-time multimedia applications.

Paper Structure

This paper contains 21 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: EGSTalker accelerates talking head synthesis by combining a structured head representation with an optimized spatial-audio attention mechanism, enabling efficient 3D Gaussian deformation with improved clarity and motion fidelity.
  • Figure 2: Overview of the EGSTalker framework. The model employs a two-stage training strategy: the first stage constructs static 3D Gaussian representations, while the second stage uses an audio-guided deformation decoder to predict dynamic facial motions.
  • Figure 3: Overview of the Efficient Gaussian Deformation Decoder (a) and the ESAA module (b). The decoder predicts Gaussian attribute offsets for audio-driven deformation, with ESAA enabling spatial-audio interaction and PPE encoding temporal information.
  • Figure 4: Qualitative results of the self-driven setting on the Obama and May datasets.Our method achieves competitive results in reconstruction quality and lip synchronization compared to the state-of-the-art 3DGS-based method, GaussianTalker, excelling in head pose and facial expression control.
  • Figure 5: Effect of Static Gaussian Initialization on 3D point distribution. Without initialization, key facial regions (e.g., lips, eyes) exhibit sparse point density, degrading expression modeling. Initialization yields denser, more structured distributions, enhancing reconstruction and dynamic fidelity.
  • ...and 1 more figures