EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation
Tianheng Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng
TL;DR
EGSTalker tackles real-time, high-fidelity audio-driven talking head generation with limited training data by introducing a two-stage Gaussian-based framework. A spatial structure encoder (multi-resolution hash triplane + KAN) builds a static 3D Gaussian head representation, which is then deformed by an Efficient Gaussian Deformation decoder that fuses audio with spatial cues via ESAA and periodic encoding. The approach achieves lip-sync accuracy and rendering quality comparable to state-of-the-art NeRF/3DGS methods while delivering substantially higher inference speed, enabling real-time applications. Ablation studies confirm the importance of KAN, ESAA, PPE, and static Gaussian initialization for both reconstruction quality and temporal synchronization.
Abstract
This paper presents EGSTalker, a real-time audio-driven talking head generation framework based on 3D Gaussian Splatting (3DGS). Designed to enhance both speed and visual fidelity, EGSTalker requires only 3-5 minutes of training video to synthesize high-quality facial animations. The framework comprises two key stages: static Gaussian initialization and audio-driven deformation. In the first stage, a multi-resolution hash triplane and a Kolmogorov-Arnold Network (KAN) are used to extract spatial features and construct a compact 3D Gaussian representation. In the second stage, we propose an Efficient Spatial-Audio Attention (ESAA) module to fuse audio and spatial cues, while KAN predicts the corresponding Gaussian deformations. Extensive experiments demonstrate that EGSTalker achieves rendering quality and lip-sync accuracy comparable to state-of-the-art methods, while significantly outperforming them in inference speed. These results highlight EGSTalker's potential for real-time multimedia applications.
