Table of Contents
Fetching ...

Audio-Synchronized Visual Animation

Lin Zhang, Shentong Mo, Yijing Zhang, Pedro Morgado

TL;DR

This work tackles audio-guided, temporally synchronized visual animation by introducing ASVA, a task that animates static images into videos aligned with audio. It presents AVSync15, a high-quality AV dataset curated for strong audio-visual synchronization across 15 classes, and AVSyncD, a diffusion-based model that augments a pre-trained latent diffusion network with time-aware audio tokens, temporal attention, and first-frame lookups to produce coherent, audio-synchronized motion. Extensive experiments demonstrate AVSync15 as a robust benchmark and AVSyncD as achieving state-of-the-art synchronization and animation quality, with ablations confirming the importance of audio conditioning, temporal modeling, and careful data curation. The work further shows the approach's flexibility, enabling animation without a base image and targeted motion control in multi-object scenes, highlighting the potential for broader, controllable audio-driven video generation.

Abstract

Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our models superior performance. We further explore AVSyncDs potential in a variety of audio synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation. More videos on project webpage https://lzhangbj.github.io/projects/asva/asva.html.

Audio-Synchronized Visual Animation

TL;DR

This work tackles audio-guided, temporally synchronized visual animation by introducing ASVA, a task that animates static images into videos aligned with audio. It presents AVSync15, a high-quality AV dataset curated for strong audio-visual synchronization across 15 classes, and AVSyncD, a diffusion-based model that augments a pre-trained latent diffusion network with time-aware audio tokens, temporal attention, and first-frame lookups to produce coherent, audio-synchronized motion. Extensive experiments demonstrate AVSync15 as a robust benchmark and AVSyncD as achieving state-of-the-art synchronization and animation quality, with ablations confirming the importance of audio conditioning, temporal modeling, and careful data curation. The work further shows the approach's flexibility, enabling animation without a base image and targeted motion control in multi-object scenes, highlighting the potential for broader, controllable audio-driven video generation.

Abstract

Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our models superior performance. We further explore AVSyncDs potential in a variety of audio synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation. More videos on project webpage https://lzhangbj.github.io/projects/asva/asva.html.
Paper Structure (16 sections, 5 equations, 6 figures, 2 tables)

This paper contains 16 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Given an audio and an image (green box), we produce animations beyond image stylization with complex but natural dynamics, synchronized with audio at each frame. Results were produced by our AVSyncD model trained on the proposed AVSync15. Project webpage: https://lzhangbj.github.io/projects/asva/asva.html.
  • Figure 2: (a): Overview of 15 categories in AVSync15. Categories are listed below x-axis on the right plot. (b): Categoriy-wise averages of av-sync score $\phi$, IA, and IT on AVSync15 and equivalently sized subsets of VGGSS and AVSync-AC. Error bars for VGGSS and AVSync-AC are obtained from 3 random splits.
  • Figure 3: AVSyncD overview. Left: We use ImageBind to encode audio into semantically aware time-dependent tokens $({\bm{a}}_t)_{t=1}^{rT}$ and CLIP to encode the audio category into text embedding ${\bm{\tau}}$. In addition, the model receives the latent of the first frame ${\bm{z}}_1$, and iteratively denoises noisy latents of the subsequent frames ${\bm{z}}_{2:rT}^k$ via reverse diffusion. The denoising UNet, based on LDMs robin2022ldm, consists of a sequence of downsampling, bottleneck and up-sampling blocks, with structure detailed on the right. Right: Anatomy of a UNet block for frame ${\bm{z}}_t$. LDM's original spatial conv, spatial attention and text cross attention layers are frozen, while its spatial self-attention layers are adjusted to first-frame spatial attentions, cross-attending to ${\bm{z}}_1$ instead. To learn video dynamics, we introduce temporal attention layers, and first-frame lookup temporal convolutions applied to input, output, and ResNet layers. We also train audio cross attentions for audio conditioning and synchronization. Trainable layers are marked with [10pt]8pt.
  • Figure 4: Qualitative results on three datasets.
  • Figure 5: (a): Effects of audio amplitude vs. classifier-free audio guidance. top: original audio with $\eta=1$; mid: $100\times$ amplified audio with $\eta=1$; bottom: original audio with $\eta=8$. (b): Animate generated images. (c): Animation with internet images and audios.
  • ...and 1 more figures