Table of Contents
Fetching ...

EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation

Arpita Saggar, Jonathan C. Darling, Duygu Sarikaya, David C. Hogg

TL;DR

EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models.

Abstract

Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce $\textbf{EmbedTalk}$, which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.

EmbedTalk: Triplane-Free Talking Head Synthesis using Embedding-Driven Gaussian Deformation

TL;DR

EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models.

Abstract

Real-time talking head synthesis increasingly relies on deformable 3D Gaussian Splatting (3DGS) due to its low latency. Tri-planes are the standard choice for encoding Gaussians prior to deformation, since they provide a continuous domain with explicit spatial relationships. However, tri-plane representations are limited by grid resolution and approximation errors introduced by projecting 3D volumetric fields onto 2D subspaces. Recent work has shown the superiority of learnt embeddings for driving temporal deformations in 4D scene reconstruction. We introduce , which shows how such embeddings can be leveraged for modelling speech deformations in talking head synthesis. Through comprehensive experiments, we show that EmbedTalk outperforms existing 3DGS-based methods in rendering quality, lip synchronisation, and motion consistency, while remaining competitive with state-of-the-art generative models. Moreover, replacing the tri-plane encoding with learnt embeddings enables significantly more compact models that achieve over 60 FPS on a mobile GPU (RTX 2060 6 GB). Our code will be placed in the public domain on acceptance.
Paper Structure (13 sections, 8 equations, 6 figures, 4 tables)

This paper contains 13 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Nearly all prior work on 3DGS-based talking head synthesis uses tri-planes, where approximation errors can hamper audio-visual alignment. EmbedTalk replaces tri-planes with learnable per-Gaussian embeddings, resulting in more accurate mouth movements and reduced computational overhead.
  • Figure 2: EmbedTalk begins with a talking portrait video. The video frames enable a dense reconstruction of the head that initialises the 3D Gaussians. Each Gaussian is also associated with a learnable embedding $z_g$. For each frame, the corresponding speech signal $a$ and upper-face movements $e$ are fed into the deformation MLP, along with a positional encoding of $z_g$ to predict the Gaussian deformations ($\Delta\mu, \Delta\alpha$). The deformed Gaussians are passed to the rasteriser, along with the viewing direction (camera), to render the head onto the combined torso and scene background.
  • Figure 3: Qualitative comparison with recent 3DGS-based works. EmbedTalk reconstructs narrow mouth openings more faithfully than other methods.
  • Figure 4: Qualitative comparison with generative methods. Despite accurate lip-sync, generative models produce exaggerated movements that reduce realism.
  • Figure 5: Differences between consecutive frames accumulated over a 20 frame interval. The white pixels in the upper head region indicate the presence of temporal flickering.
  • ...and 1 more figures