Table of Contents
Fetching ...

SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion

Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo chen, Kai Li, Yu Meng

TL;DR

SayAnything presents an end-to-end audio-driven lip synchronization framework built on Stable Video Diffusion that directly aligns lip motion with audio without lip-expert supervision or intermediate representations. It introduces three conditioning modules—ID-Guider for identity, adaptive editing masking for region-specific control, and audio guidance with cross-attention—to fuse reference appearance, audio signals, and masked video in a unified denoising process. The approach achieves superior visual fidelity, temporal coherence, and lip-sync quality across real and animated characters, with strong zero-shot generalization and competitive or superior metrics on HDTF and AVASpeech datasets. Its design reduces dependencies on heavy priors, enabling flexible generation across diverse styles while maintaining identity, texture, and motion realism, making it promising for dubbing, virtual avatars, and animated content creation.

Abstract

Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.

SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion

TL;DR

SayAnything presents an end-to-end audio-driven lip synchronization framework built on Stable Video Diffusion that directly aligns lip motion with audio without lip-expert supervision or intermediate representations. It introduces three conditioning modules—ID-Guider for identity, adaptive editing masking for region-specific control, and audio guidance with cross-attention—to fuse reference appearance, audio signals, and masked video in a unified denoising process. The approach achieves superior visual fidelity, temporal coherence, and lip-sync quality across real and animated characters, with strong zero-shot generalization and competitive or superior metrics on HDTF and AVASpeech datasets. Its design reduces dependencies on heavy priors, enabling flexible generation across diverse styles while maintaining identity, texture, and motion realism, making it promising for dubbing, virtual avatars, and animated content creation.

Abstract

Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.

Paper Structure

This paper contains 33 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Overview of SayAnything architecture for lip synchronization. The denoising UNet takes noisy latents as input, concatenated with video latents obtained from masked video through VAE encoding. The reference image is processed by ID-Guider to produce multi-scale ID features, which are injected as residual signals into the denoising UNet. Audio features from Whisper are fused through cross-attention layers in the denoising process. (b) A typical UNet block, consisting of ResNet block, Self Attention, Audio Cross Attention, and Temporal Attention.
  • Figure 2: Our adaptive masking strategy first determines the initial mask through detected landmarks, then obtains the final mask through expansion and smoothing, effectively preventing motion leakage.
  • Figure 3: Qualitative comparisons with SOTA diffusion-based lip-sync methods mukhopadhyay2024diff2lipzhang2024musetalkli2024latentsynccheng2022videoretalking. The first row demonstrates the original input video, and the second row is the video from which we extracted the audio as input, the video can be regarded as the target lip movements. Rows 3 - 7 display the lip-synced videos. (a) Two cases in the cross-sex and ID generation setting. (b) Two cases in the animate settings. Our method can generate more federal visual features like driven animators while others tend to generate fake features which are more realistic.
  • Figure 4: Qualitative comparison of lip motion dynamics and tooth rendering. Our method demonstrates clearer and more consistent teeth as well as more flexible lip movements.
  • Figure 5: Ablation studies of our components in SayAnything. Video Fusion and VAE Feature significantly enhance reference image influence, limiting the range of lip movements. Larger fixed masks lead to colour shifts in masked regions and unnatural lip motions. Removing the condition masking strategy reduces visual quality. Zoom in for generated details.
  • ...and 2 more figures