KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum
TL;DR
KeyVID addresses the challenge of achieving precise audio-visual synchronization in diffusion-based video generation at low frame rates by learning to focus on salient motion moments. It introduces a three-stage, keyframe-aware pipeline: (1) a keyframe localizer that derives time steps from audio via an optical-flow-inspired motion score, (2) an audio-conditioned keyframe generator that creates sparse keyframes conditioned on the first frame and multi-modal inputs, and (3) a motion interpolator that fills in non-keyframe frames using diffusion with frame-conditioned guidance. The method leverages frame index embeddings and keyframe-aligned audio, image, and text features with cross-attention, and uses FreeNoise to produce dense final videos in a single pass. Extensive experiments on AVSync15, Greatest Hits, and Landscapes show superior audio-visual synchronization and visual quality, with ablations confirming the importance of keyframe-based sampling, frame indexing, and first-frame conditioning, plus strong open-domain generalization capabilities.
Abstract
Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.
