SAiD: Speech-driven Blendshape Facial Animation with Diffusion
Inkyu Park, Jaewoong Cho
TL;DR
This work tackles the scarcity of large-scale visual-audio data for speech-driven 3D facial animation by introducing SAiD, a diffusion-based approach that generates diverse, lip-synced blendshape coefficients conditioned on speech via a lightweight Transformer-based UNet with an alignment bias. To support research in blendshape-based animation, the authors present BlendVOCA, a dataset built from VOCASET by transferring ARKit blendshapes and solving a constrained QP to obtain smooth coefficient sequences. Across extensive experiments, SAiD achieves strong lip-sync alignment, heightened diversity, and effective editing capabilities compared to baselines that operate on vertex meshes. The combination of diffusion modeling, alignment bias, and blendshape-centric evaluation provides a practical, editable pipeline for realistic speech-driven facial animation with accessible data resources. The BlendVOCA dataset and SAiD framework together offer a pathway for more flexible, data-efficient animation in games, films, and AR/VR applications.
Abstract
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets despite extensive research. Most prior works, typically focused on learning regression models on a small dataset using the method of least squares, encounter difficulties generating diverse lip movements from speech and require substantial effort in refining the generated outputs. To address these issues, we propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization. Moreover, we introduce BlendVOCA, a benchmark dataset of pairs of speech audio and parameters of a blendshape facial model, to address the scarcity of public resources. Our experimental results demonstrate that the proposed approach achieves comparable or superior performance in lip synchronization to baselines, ensures more diverse lip movements, and streamlines the animation editing process.
