Table of Contents
Fetching ...

GaussianSpeech: Audio-Driven Gaussian Avatars

Shivangi Aneja, Artem Sevastopolsky, Tobias Kirschstein, Justus Thies, Angela Dai, Matthias Nießner

TL;DR

GaussianSpeech addresses the challenge of producing photorealistic, 3D-consistent talking-head avatars driven by speech. It combines a compact 3D Gaussian Splatting avatar with expression- and view-dependent color, wrinkle-aware perceptual losses, and a transformer-based sequence model that maps audio features to realistic mouth and wrinkle dynamics. A new large-scale multi-view audio-visual dataset supports training and evaluation, enabling real-time, free-viewpoint rendering with diverse expressions and speaker styles. The approach outperforms competitive baselines in lip synchronization and visual quality, and is strengthened by careful avatar initialization, wrinkle regularization, and targeted ablations that highlight the importance of alignment, audio fine-tuning, and color/latent refinements. Altogether, GaussianSpeech advances high-fidelity, audio-driven 3D avatar synthesis suitable for telepresence and digital storytelling.

Abstract

We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.

GaussianSpeech: Audio-Driven Gaussian Avatars

TL;DR

GaussianSpeech addresses the challenge of producing photorealistic, 3D-consistent talking-head avatars driven by speech. It combines a compact 3D Gaussian Splatting avatar with expression- and view-dependent color, wrinkle-aware perceptual losses, and a transformer-based sequence model that maps audio features to realistic mouth and wrinkle dynamics. A new large-scale multi-view audio-visual dataset supports training and evaluation, enabling real-time, free-viewpoint rendering with diverse expressions and speaker styles. The approach outperforms competitive baselines in lip synchronization and visual quality, and is strengthened by careful avatar initialization, wrinkle regularization, and targeted ablations that highlight the importance of alignment, audio fine-tuning, and color/latent refinements. Altogether, GaussianSpeech advances high-fidelity, audio-driven 3D avatar synthesis suitable for telepresence and digital storytelling.

Abstract

We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple speech signal with 3D Gaussian splatting to create realistic, temporally coherent motion sequences. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details, including wrinkles that occur with different expressions. To enable sequence modeling of 3D Gaussian splats with audio, we devise an audio-conditioned transformer model capable of extracting lip and expression features directly from audio input. Due to the absence of high-quality datasets of talking humans in correspondence with audio, we captured a new large-scale multi-view dataset of audio-visual sequences of talking humans with native English accents and diverse facial geometry. GaussianSpeech consistently achieves state-of-the-art performance with visually natural motion at real time rendering rates, while encompassing diverse facial expressions and styles.

Paper Structure

This paper contains 34 sections, 30 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Given input speech signal, GaussianSpeech can synthesize photorealistic 3D-consistent talking human head avatars. Our method can generate realistic and high-quality animations, including mouth interiors such as teeth, wrinkles, and specularities in the eyes. We handle diverse facial geometry, including hair buns and mustaches/beards, while effectively synchronizing to the audio signal.
  • Figure 2: Random frames selected for each participant (top) from the dataset and corresponding zoom-in for the mouth region (bottom). We captured a gender-balanced dataset of native speakers with different English accents and diverse facial geometry including different skin tones, beard and glasses to maximize diversity.
  • Figure 3: Person-specific 3D Avatar: We compute 3D face tracking and bind 3D Gaussians to the triangles of the tracked FLAME mesh. We apply volume-based pruning to prevent optimization to generate large amount of Gaussians, and apply subdivision of mesh triangles in the mouth region. We train color MLP $\theta_\textrm{color}$ to synthesize expression & view dependent color. We apply wrinkle regularization and perceptual losses to improve photorealism.
  • Figure 4: Method Overview. From the given speech signal, GaussianSpeech uses Wav2Vec 2.0 baevski2020wav2vec encoder to extract generic audio features and maps them to personalized lip feature embeddings $\boldsymbol{c}^{1:T}$ with Lip Transformer Encoder and wrinkle features $\boldsymbol{w}^{1:T}$ with Wrinkle Transformer Encoder. Next, the Expression Encoder synthesizes FLAME expressions $\boldsymbol{e}^{1:T}$ which are then projected via Expression2Latent MLP and concatenated with $\boldsymbol{c}^{1:T}$ for input to the motion decoder. The motion decoder employs a multi-head transformer decoder vaswani2023attention consisting of Multihead Self-Attention, Cross-Attention, and Feed Forward layers. The concatenated lip-expression features are fused into the decoder via cross-attention layers with alignment mask $\mathcal{M}$. The decoder then predicts FLAME vertex offsets $\{ \boldsymbol{V}_\textrm{offset} \}^{1:T}$ which gets added to the template mesh $\boldsymbol{T}$ to generate vertex animation in canonical space. During training, these are then fed to our optimized 3DGS avatar (Sec. \ref{['method:avatara_init']}) and the color MLP $\boldsymbol{\theta}_\textrm{color}$ and gaussian latents $\boldsymbol{z}$ are further refined via re-rendering losses kerbl3Dgaussians.
  • Figure 5: Avatar Reconstruction: GaussianAvatars qian2023gaussianavatars produces blurry results and cannot handle dynamic wrinkles. For our method, without perceptual loss it cannot synthesize sharp textures for global & local less observed regions like teeth, wrinkle regularization helps to model dynamic wrinkles, mouth faces subdivision helps with the better mouth interior and Color MLP helps synthesize sharper colors and accurate dynamic wrinkles. Our full avatar initialization technique with all regularization achieves the best results.
  • ...and 14 more figures