SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion
Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo chen, Kai Li, Yu Meng
TL;DR
SayAnything presents an end-to-end audio-driven lip synchronization framework built on Stable Video Diffusion that directly aligns lip motion with audio without lip-expert supervision or intermediate representations. It introduces three conditioning modules—ID-Guider for identity, adaptive editing masking for region-specific control, and audio guidance with cross-attention—to fuse reference appearance, audio signals, and masked video in a unified denoising process. The approach achieves superior visual fidelity, temporal coherence, and lip-sync quality across real and animated characters, with strong zero-shot generalization and competitive or superior metrics on HDTF and AVASpeech datasets. Its design reduces dependencies on heavy priors, enabling flexible generation across diverse styles while maintaining identity, texture, and motion realism, making it promising for dubbing, virtual avatars, and animated content creation.
Abstract
Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.
