Table of Contents
Fetching ...

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

Zhongju Wang, Zhenhong Sun, Beier Wang, Yifu Wang, Daoyi Dong, Huadong Mo, Hongdong Li

TL;DR

3DXTalker tackles the expressivity gap in audio-driven 3D avatars by jointly modeling identity, lip synchronization, emotion, and spatial dynamics. It introduces a data-curated 2D-to-3D identity pipeline using EMOCA and FLAME, enhances audio representations with frame-wise amplitude and emotion cues, and unifies these signals with a flow-matching transformer. Inference-time controllability via global emotion templates and head-pose trajectories enables flexible, cinematography-inspired styling while preserving identity and motion coherence. Extensive experiments show strong 3D geometry accuracy, credible emotional expressivity, and real-time performance, highlighting the approach's potential for scalable, expressive digital humans.

Abstract

Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

TL;DR

3DXTalker tackles the expressivity gap in audio-driven 3D avatars by jointly modeling identity, lip synchronization, emotion, and spatial dynamics. It introduces a data-curated 2D-to-3D identity pipeline using EMOCA and FLAME, enhances audio representations with frame-wise amplitude and emotion cues, and unifies these signals with a flow-matching transformer. Inference-time controllability via global emotion templates and head-pose trajectories enables flexible, cinematography-inspired styling while preserving identity and motion coherence. Extensive experiments show strong 3D geometry accuracy, credible emotional expressivity, and real-time performance, highlighting the approach's potential for scalable, expressive digital humans.

Abstract

Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.
Paper Structure (26 sections, 11 equations, 17 figures, 6 tables)

This paper contains 26 sections, 11 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Overview of our expressive 3D talking avatar generation. Given a single reference image and a driving speech audio, 3DXTalker produces identity-consistent 3D talking avatars with accurate lip synchronization, expressive emotions, and natural head-pose dynamics. Homepage is avaliable at https://engineeringai-lab.github.io/3DXTalker.github.io/.
  • Figure 2: Overview of 3DXTalker framework. (a) A multi-branch flow-matching transformer fuses identity and audio cues to model disentangled FLAME parameter space. (b) Frame-wise audio amplitude contributes to coherent mouth aperture and head dynamics. (c) Frame-wise emotion embeddings help modulate emotional expressions.
  • Figure 3: Qualitative comparisons over selected typical baselines. (a) shows the consistency between generated meshes and the reference image. (b) shows better mouth aperture alignment. (c) shows finer emotional expressiveness. (d) shows predicted natural head pose and camera movements. Full baseline comparisons are provided in Appendix \ref{['append:baseline']}. Other emotion comparisons are offered in Appendix \ref{['app:more_emo_compare']}
  • Figure 4: Visualizations of ablation results from Table \ref{['tab:ablation_qualitative']}. (a) is conducted on the same audio. (b) extracts each emotion from corresponding videos at the same frame. Details in Appendix \ref{['append:amplitude']}.
  • Figure 5: Our 3DXTalker supports two head-pose modes: (a) natural micro-movements learned from in-the-wild data, and (b) controllable head dynamics (with natural micro-movements) guided by a center motion trajectory. Trajectory colors indicate temporal progression (dark$\rightarrow$light). See Appendix \ref{['append:pose']} for more examples.
  • ...and 12 more figures