Table of Contents
Fetching ...

Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation

Rafael Redondo

TL;DR

This work tackles the challenge of generating temporally coherent videos from agnostic audio signals using GANs. It introduces a triple sound routing scheme, a residual multi-scale DilatedRNN for extended audio analysis, and a directional ConvGRU-based video prediction layer to jointly improve frame fidelity and motion consistency. Across ablations and robustness tests, each component yields improvements in perceptual and temporal metrics, demonstrating better resilience to audio distribution shifts. The approach advances generic sound-to-video generation with practical implications, while noting computational costs and ethical considerations inherent to realistic audiovisual synthesis.

Abstract

Deep generative models have demonstrated the ability to create realistic audiovisual content, sometimes driven by domains of different nature. However, smooth temporal dynamics in video generation is a challenging problem. This work focuses on generic sound-to-video generation and proposes three main features to enhance both image quality and temporal coherency in generative adversarial models: a triple sound routing scheme, a multi-scale residual and dilated recurrent network for extended sound analysis, and a novel recurrent and directional convolutional layer for video prediction. Each of the proposed features improves, in both quality and coherency, the baseline neural architecture typically used in the SoTA, with the video prediction layer providing an extra temporal refinement.

Listen and Move: Improving GANs Coherency in Agnostic Sound-to-Video Generation

TL;DR

This work tackles the challenge of generating temporally coherent videos from agnostic audio signals using GANs. It introduces a triple sound routing scheme, a residual multi-scale DilatedRNN for extended audio analysis, and a directional ConvGRU-based video prediction layer to jointly improve frame fidelity and motion consistency. Across ablations and robustness tests, each component yields improvements in perceptual and temporal metrics, demonstrating better resilience to audio distribution shifts. The approach advances generic sound-to-video generation with practical implications, while noting computational costs and ethical considerations inherent to realistic audiovisual synthesis.

Abstract

Deep generative models have demonstrated the ability to create realistic audiovisual content, sometimes driven by domains of different nature. However, smooth temporal dynamics in video generation is a challenging problem. This work focuses on generic sound-to-video generation and proposes three main features to enhance both image quality and temporal coherency in generative adversarial models: a triple sound routing scheme, a multi-scale residual and dilated recurrent network for extended sound analysis, and a novel recurrent and directional convolutional layer for video prediction. Each of the proposed features improves, in both quality and coherency, the baseline neural architecture typically used in the SoTA, with the video prediction layer providing an extra temporal refinement.

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Main architecture. The training video is batched in consecutive $T$-frame sequences. Audio features $\boldsymbol{f}$ (LMS) are routed to: (1) a recurrent neural network $R$ as source of motion concatenated with noise $\boldsymbol{e}$, (2) the generator $G$ as source of content concatenated with motion tokens $\boldsymbol{m}$, and (3) generative instance normalization layers. $R$ comprises a residual multi-scale DilatedRNN. $G$ contains Directional ConvRNNs. The video discriminator $D_V$ learns from reals $\boldsymbol{v}$, fakes $\boldsymbol{\tilde{v}}$, and shuffled versions $\boldsymbol{\hat{v}}$. The image discriminator $D_I$ receives real $v_t$ and synthetic $\tilde{v}_t$ frames with the same random index.
  • Figure 2: Motion encoding: a 3-layer DilatedRNN with residual connections (depicted only for the current time step). Note recurrent cells have the same input-output size.
  • Figure 3: Main building blocks made of a series of convolutional, normalization, and activation layers (LeakyReLU). The generator uses audio conditional instance normalization, noise injection, and residual connections. The discriminators instead use batch normalization, an extra temporal dimension (video), and skip connections (b) through convolutional layers (fRGB) transforming latent to color spaces.
  • Figure 4: Video prediction layer: 4-directional and 1-centered ConvGRUs (kernel size 3). Spatial predictions are channel-wise concatenated into $X_t$ and blended by $1\!\times\!1$ convolutions to accommodate output channels. Merged predictions $x'_t$ and previous hallucinated activations $x_t$ contribute to the output $x"_t$ according to an auto-regressive mask $a_t$ shared across channels with the Hadamard product $\odot$. Note opposite directions share weights.
  • Figure 5: Illustration of artifacts produced by a $512\!\times\!512$ vanilla GAN with (left) skip-connections, (middle) residual connections, and (right) a residual generator and skip-connected discriminators.
  • ...and 1 more figures