Table of Contents
Fetching ...

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting

Bo Chen, Shoukang Hu, Qi Chen, Chenpeng Du, Ran Yi, Yanmin Qian, Xie Chen

TL;DR

GSTalker tackles the challenge of real-time, high-fidelity audio-driven talking-face generation by adopting a deformable 3D Gaussian Splatting approach. A static Gaussian head is initialized from talking-face images and deformed by an audio-conditioned field (with a tri-plane hash encoding and temporal smoothing) to synchronize lips with speech, while a pose-conditioned deformation stabilizes the torso; real-time rendering is achieved via differentiable Gaussian splatting. The method reports fast training (~40 minutes) and real-time inference (~125 FPS), outperforming 2D and NeRF-based baselines in both speed and visual quality, and shows strong lip-sync and generalization in cross-speaker tests. These results suggest GSTalker enables practical, scalable digital humans for applications such as video conferencing, virtual dubbing, and AR/VR, with significantly reduced computational requirements compared to prior NeRF-based frameworks.

Abstract

We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting

TL;DR

GSTalker tackles the challenge of real-time, high-fidelity audio-driven talking-face generation by adopting a deformable 3D Gaussian Splatting approach. A static Gaussian head is initialized from talking-face images and deformed by an audio-conditioned field (with a tri-plane hash encoding and temporal smoothing) to synchronize lips with speech, while a pose-conditioned deformation stabilizes the torso; real-time rendering is achieved via differentiable Gaussian splatting. The method reports fast training (~40 minutes) and real-time inference (~125 FPS), outperforming 2D and NeRF-based baselines in both speed and visual quality, and shows strong lip-sync and generalization in cross-speaker tests. These results suggest GSTalker enables practical, scalable digital humans for applications such as video conferencing, virtual dubbing, and AR/VR, with significantly reduced computational requirements compared to prior NeRF-based frameworks.

Abstract

We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 35 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed.
Paper Structure (21 sections, 8 equations, 3 figures, 5 tables)

This paper contains 21 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of GSTalker. We model the whole talking face in 3D Gaussians and deformation fields. Firstly, a static initialization stage is used to optimize the coarse static Gaussian of the face from a random point cloud. Then an audio-conditioned deformation field fuses spatial features from a multi-resolution tri-plane hash grid with audio features from an audio encoder to predict the position and shape changes of the 3D Gaussians. Given a camera pose, the deformed Gaussians achieve real-time rendering using the differentiable rasterizer. For the torso part, a similar pose-condition deformation field is adopted to drive the stabilizing motion of the torso.
  • Figure 2: The qualitative comparison with 2D-bsed and 3D NeRF-based Mehods. We represent the image results generated by our method and all the baselines on the AD-NeRFguo2021ad dataset.
  • Figure 3: The qualitative comparison with real-time NeRF-based methods. We show more generated results compared to the real-time NeRF-based methods. These results include the performance across genders and languages in the self-driven setting.