Table of Contents
Fetching ...

GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

Kyusun Cho, Joungbin Lee, Heeji Yoon, Yeobin Hong, Jaehoon Ko, Sangjun Ahn, Seungryong Kim

TL;DR

GaussianTalker presents a real-time, pose-controllable talking head framework that leverages 3D Gaussian Splatting to render dynamic heads. It learns a canonical 3DGS head through a multi-resolution triplane and predicts per-frame Gaussian deformations via a spatial-audio cross-attention module, enabling stable and accurate lip synchronization. The approach uses stage-wise training with a canonical stage and a deformation stage, along with eye-blink, viewpoint, and null-vector cues to disentangle audio-driven motion from non-audio scene changes, achieving up to 120 FPS and improved fidelity over NeRF-based baselines. This work advances real-time neural rendering of talking heads with detailed facial motion and hair, suitable for digital humans, avatars, and teleconferencing, while releasing code for reproducibility.

Abstract

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our code is made available at https://github.com/KU-CVLAB/GaussianTalker/ .

GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

TL;DR

GaussianTalker presents a real-time, pose-controllable talking head framework that leverages 3D Gaussian Splatting to render dynamic heads. It learns a canonical 3DGS head through a multi-resolution triplane and predicts per-frame Gaussian deformations via a spatial-audio cross-attention module, enabling stable and accurate lip synchronization. The approach uses stage-wise training with a canonical stage and a deformation stage, along with eye-blink, viewpoint, and null-vector cues to disentangle audio-driven motion from non-audio scene changes, achieving up to 120 FPS and improved fidelity over NeRF-based baselines. This work advances real-time neural rendering of talking heads with detailed facial motion and hair, suitable for digital humans, avatars, and teleconferencing, while releasing code for reproducibility.

Abstract

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our code is made available at https://github.com/KU-CVLAB/GaussianTalker/ .
Paper Structure (55 sections, 19 equations, 11 figures, 5 tables)

This paper contains 55 sections, 19 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Fidelity, lip synchronization and inference time comparison between existing 3D talking face synthesis models guo2021adnerftang2022radnerfli2023ernerf and ours. Our method, GaussianTalker, achieves superior performance at much higher FPS. Note that we also include GaussianTalker$^*$, a more efficient and faster variant. Size of each bubble represents the inference time per frame of each method.
  • Figure 2: Overview of our GaussianTalker framework. GaussianTalker utilizes a multi-resolution triplane to leverage different scales of features depicting a canonical 3D head. These features are fed into a spatial-audio attention module along with the audio feature to predict per-frame deformations, enabling fast and reliable talking head synthesis.
  • Figure 3: Visualization of the triplane feature grids. The sequence displays a rendered image, followed by its orthographic projected embeddings: frontal (xy), overhead (yz), and side (zx) views. The embeddings are visualized by reducing its dimension to 3 using PCA.
  • Figure 4: Illustration of attention score distributions across different modalities for two individuals. From left to right: the original rendered image, attention scores responsible for audio cues, eye blink dynamics, head orientation (facial viewpoint), and temporal consistency (null), respectively.
  • Figure 5: Comparative visualization of lip synchronization across different audio-visual models. The sequence depicts the lip shape conforming to specific phonemes in the spoken words 'country', 'of', 'crime', 'we', 'up', 'especially', 'like', with the last frame showing a closed mouth ('mute').
  • ...and 6 more figures