Table of Contents
Fetching ...

SAiD: Speech-driven Blendshape Facial Animation with Diffusion

Inkyu Park, Jaewoong Cho

TL;DR

This work tackles the scarcity of large-scale visual-audio data for speech-driven 3D facial animation by introducing SAiD, a diffusion-based approach that generates diverse, lip-synced blendshape coefficients conditioned on speech via a lightweight Transformer-based UNet with an alignment bias. To support research in blendshape-based animation, the authors present BlendVOCA, a dataset built from VOCASET by transferring ARKit blendshapes and solving a constrained QP to obtain smooth coefficient sequences. Across extensive experiments, SAiD achieves strong lip-sync alignment, heightened diversity, and effective editing capabilities compared to baselines that operate on vertex meshes. The combination of diffusion modeling, alignment bias, and blendshape-centric evaluation provides a practical, editable pipeline for realistic speech-driven facial animation with accessible data resources. The BlendVOCA dataset and SAiD framework together offer a pathway for more flexible, data-efficient animation in games, films, and AR/VR applications.

Abstract

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.

SAiD: Speech-driven Blendshape Facial Animation with Diffusion

TL;DR

This work tackles the scarcity of large-scale visual-audio data for speech-driven 3D facial animation by introducing SAiD, a diffusion-based approach that generates diverse, lip-synced blendshape coefficients conditioned on speech via a lightweight Transformer-based UNet with an alignment bias. To support research in blendshape-based animation, the authors present BlendVOCA, a dataset built from VOCASET by transferring ARKit blendshapes and solving a constrained QP to obtain smooth coefficient sequences. Across extensive experiments, SAiD achieves strong lip-sync alignment, heightened diversity, and effective editing capabilities compared to baselines that operate on vertex meshes. The combination of diffusion modeling, alignment bias, and blendshape-centric evaluation provides a practical, editable pipeline for realistic speech-driven facial animation with accessible data resources. The BlendVOCA dataset and SAiD framework together offer a pathway for more flexible, data-efficient animation in games, films, and AR/VR applications.

Abstract

Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.
Paper Structure (46 sections, 19 equations, 11 figures, 2 tables)

This paper contains 46 sections, 19 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of SAiD. The conditional diffusion model generates the sequence of blendshape coefficients from the Gaussian noise conditioned on the speech waveform. After that, generated blendshape coefficients are converted into the facial animation using the blendshape facial model.
  • Figure 2: BlendVOCA construction process. The process unfolds in two steps: 1) We transfer deformations of the reference mesh from ARKit arkit to 12 template meshes of VOCASET cudeiro2019capture by applying the algorithm introduced by sumner2004deformation, which produce 32 output blendshape meshes for each template mesh; 2) and then generate blendshape coefficients by solving quadratic programming problem in \ref{['eq:qp']}.
  • Figure 3: The model architecture of SAiD. SAiD predicts the noise injected into the input noisy blendshape coefficient sequence, conditioned on the speech waveform, for each diffusion timestep. The denoiser model is a simplified conditional UNet1D model, composed of 1 encoder block/1 middle block/1 decoder block without the downsampling and upsampling layers. Diffusion timestep is converted into the sinusoidal embedding and then becomes the input of each residual block in the denoiser. Speech waveform is converted into the audio feature vectors using the frozen pre-trained Wav2Vec 2.0 and becomes the key and value matrices of the cross-attention layer in the denoiser. We employ the alignment bias as a memory mask for the cross-attention layer to enhance the alignment between the speech and blendshape coefficient sequence. We also adopt the trainable null condition embedding for implementing the classifier-free guidance (or for the unconditional generation), providing an alternative to using the audio features.
  • Figure 4: Motion editing. Hatched boxes indicate the masked areas that should be invariant during the editing. SAiD can generate motions on the unmasked area using motion editing in \ref{['sec:method:model:edit']}. We provide the videos results of these editing tasks at \projectpageurl.
  • Figure 5: Effect of the velocity loss. Blue lines indicate SAiD's inference results with velocity loss training, while orange lines display results without velocity loss. As highlighted in the red box, the blue lines demonstrate notably reduced jitter compared to the orange lines.
  • ...and 6 more figures