Table of Contents
Fetching ...

Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

Xuanchen Li, Jianyu Wang, Yuhao Cheng, Yikun Zeng, Xingyu Ren, Wenhan Zhu, Weiming Zhao, Yichao Yan

TL;DR

TexTalk4D addresses the realism gap in audio-driven 3D talking heads by incorporating dynamic textures into the synthesis pipeline. It introduces TexTalker, a diffusion-based framework that jointly generates geometry and 8K dynamic textures from speech by learning motion and wrinkle animation primitives and aligning them through latent diffusion. A pivot-based style injection enables disentangled control over speaking and wrinkling styles, supporting highly personalized avatars. The work contributes a large-scale 4D dataset and demonstrates superior geometry quality and texture realism, with planned public release of code and data.

Abstract

Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset \textbf{TexTalk4D}, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework \textbf{TexTalker} to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements. Project page: https://xuanchenli.github.io/TexTalk/.

Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

TL;DR

TexTalk4D addresses the realism gap in audio-driven 3D talking heads by incorporating dynamic textures into the synthesis pipeline. It introduces TexTalker, a diffusion-based framework that jointly generates geometry and 8K dynamic textures from speech by learning motion and wrinkle animation primitives and aligning them through latent diffusion. A pivot-based style injection enables disentangled control over speaking and wrinkling styles, supporting highly personalized avatars. The work contributes a large-scale 4D dataset and demonstrates superior geometry quality and texture realism, with planned public release of code and data.

Abstract

Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset \textbf{TexTalk4D}, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework \textbf{TexTalker} to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements. Project page: https://xuanchenli.github.io/TexTalk/.

Paper Structure

This paper contains 31 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: We present TexTalk4D, a high-precision 4D audio-mesh-texture-aligned dataset consisting of 100 minutes of scan-level meshes with detailed 8K textures. Based on the dataset, we present TexTalker to generate geometry and aligned dynamic textures from speech simultaneously, advancing towards highly personalized textured facial animation.
  • Figure 2: Data processing pipeline. We employ Topo4D li2024topo4d to obtain consistent meshes from LightStage captures. Based on the mesh sequence, we map the color from 24-view 4K images to get the 8K textures. Mesh alignment and texture blending are then conducted to get the final assets.
  • Figure 3: The overview of TexTalker.(a) We train quantized autoencoders to unify the representation of geometry and texture with better efficiency. (b) Based on the learned low-dimensional animation primitives, we employ an LDM to jointly diffuse geometry and texture latent offsets $\Delta \mathbf{z}$ from the style pivots $\mathbf{p}$ for long-term correlation learning. (c) By adding back the style pivots, the motion and wrinkle styles can be independently controlled. Finally, the personalized textured animation assets can be obtained by decoders.
  • Figure 4: t-SNE distribution of latent features from 20 subjects. The learned animation primitive spaces effectively distinguish latent features of different styles both in wrinkles and motions.
  • Figure 5: Visual comparison of generated motion. The upper partition shows samples conditioned by different phonemes and the syllables are highlighted in orange. The lower partition depicts the temporal motion statistics of the whole sequence, where the brighter the color, the more motion is observed.
  • ...and 6 more figures