Table of Contents
Fetching ...

KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum

TL;DR

KeyVID addresses the challenge of achieving precise audio-visual synchronization in diffusion-based video generation at low frame rates by learning to focus on salient motion moments. It introduces a three-stage, keyframe-aware pipeline: (1) a keyframe localizer that derives time steps from audio via an optical-flow-inspired motion score, (2) an audio-conditioned keyframe generator that creates sparse keyframes conditioned on the first frame and multi-modal inputs, and (3) a motion interpolator that fills in non-keyframe frames using diffusion with frame-conditioned guidance. The method leverages frame index embeddings and keyframe-aligned audio, image, and text features with cross-attention, and uses FreeNoise to produce dense final videos in a single pass. Extensive experiments on AVSync15, Greatest Hits, and Landscapes show superior audio-visual synchronization and visual quality, with ablations confirming the importance of keyframe-based sampling, frame indexing, and first-frame conditioning, plus strong open-domain generalization capabilities.

Abstract

Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.

KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

TL;DR

KeyVID addresses the challenge of achieving precise audio-visual synchronization in diffusion-based video generation at low frame rates by learning to focus on salient motion moments. It introduces a three-stage, keyframe-aware pipeline: (1) a keyframe localizer that derives time steps from audio via an optical-flow-inspired motion score, (2) an audio-conditioned keyframe generator that creates sparse keyframes conditioned on the first frame and multi-modal inputs, and (3) a motion interpolator that fills in non-keyframe frames using diffusion with frame-conditioned guidance. The method leverages frame index embeddings and keyframe-aligned audio, image, and text features with cross-attention, and uses FreeNoise to produce dense final videos in a single pass. Extensive experiments on AVSync15, Greatest Hits, and Landscapes show superior audio-visual synchronization and visual quality, with ablations confirming the importance of keyframe-based sampling, frame indexing, and first-frame conditioning, plus strong open-domain generalization capabilities.

Abstract

Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in https://github.com/XingruiWang/KeyVID.

Paper Structure

This paper contains 26 sections, 9 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) Uniform frames vs. keyframes.Top: Uniformly sampled sparse frames, which fail to capture the key moments evident in the corresponding audio (Middle). Bottom: Keyframes precisely aligned with the hammer striking down, matching the critical moments in the audio waveform. (b) KeyVID video generation pipeline. KeyVID first detects keyframe time steps from the audio input with the keyframe localizer and then utilizes a keyframe generator to generate the corresponding visual keyframes. Intermediate frames are generated with the motion interpolator.
  • Figure 2: Motion score computation and prediction. (a) We compute motion scores as the average of the optical flow of each frame and localize keyframe from the peaks and valleys. (b) Keyframe localizer is trained to predict motion scores from audio to identify keyframe locations.
  • Figure 3: Keyframe data selection and keyframe generator. (a) We select keyframes based on the local maxima and minima of the motion score. (b) The keyframe generator is trained to generate these sparse keyframes conditioned on the audios, first frame image, text, and keyframe indices. These conditions are encoded and passed into the denoising U-Net. In each denoising U-Net block, the index embeddings are added with video features and passed into Residual convolutional block (Res. Conv.). The following layers contain a spatial self-attention (SA) and spatial cross attention (CA) on each three conditional features. The output of each CA is followed by a gating with learnable weights $\lambda_1$ and $\lambda_2$. Please see details in \ref{['sec:keyframe_gen']}.
  • Figure 4: Qualitative comparison of KeyVID and baseline methods. We crop key motions on the audio waveform in (a) and the corresponding ground truth video in (b) as references and compare the generated video clips between models from (c) to (f). KeyVID with keyframe awareness (c) shows better alignment with motion peaks in audio signals—for example, the hammer striking, gunshots producing smoke, or facial movements when dogs bark or frogs croak.
  • Figure 5: RelSync scores across motion intensity levels. KeyVID improves audio synchronization score on all motion intensity.
  • ...and 2 more figures