Table of Contents
Fetching ...

Tell What You Hear From What You See -- Video to Audio Generation Through Text

Xiulong Liu, Kun Su, Eli Shlizerman

TL;DR

VATT introduces a unified, text-conditioned video-to-audio generation framework that also produces audio captions. It merges a Video-to-Caption stage, which projects video features into an LLM embedding space to generate descriptive audio captions, with a Video+Text to Audio stage that uses a bi-directional transformer to generate discrete audio tokens conditioned on both visual cues and optional text prompts, decoded into waveforms by Encodec. The approach enables controllable generation via text and supports captioning, achieving competitive objective metrics and substantial gains when text guidance is provided (e.g., KLD as low as 1.41) while delivering favorable subjective quality. Experiments on VGGSound and AudioSet-2M show strong audio-text alignment (CLAP) and faster generation due to a masking-based parallel decoding scheme. This work advances multimodal generation by integrating LLM-based cross-modal encoding with efficient token-based audio generation and opens possibilities for text-driven, interactive video-audio creation and captioning, with considerations for safeguards and broader societal impacts.

Abstract

The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

Tell What You Hear From What You See -- Video to Audio Generation Through Text

TL;DR

VATT introduces a unified, text-conditioned video-to-audio generation framework that also produces audio captions. It merges a Video-to-Caption stage, which projects video features into an LLM embedding space to generate descriptive audio captions, with a Video+Text to Audio stage that uses a bi-directional transformer to generate discrete audio tokens conditioned on both visual cues and optional text prompts, decoded into waveforms by Encodec. The approach enables controllable generation via text and supports captioning, achieving competitive objective metrics and substantial gains when text guidance is provided (e.g., KLD as low as 1.41) while delivering favorable subjective quality. Experiments on VGGSound and AudioSet-2M show strong audio-text alignment (CLAP) and faster generation due to a masking-based parallel decoding scheme. This work advances multimodal generation by integrating LLM-based cross-modal encoding with efficient token-based audio generation and opens possibilities for text-driven, interactive video-audio creation and captioning, with considerations for safeguards and broader societal impacts.

Abstract

The content of visual and audio scenes is multi-faceted such that a video can be paired with various audio and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description of the audio. Such a framework has two advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, a LLM that is fine-tuned for instructions and includes a projection layer that maps video features to the LLM vector space; and VATT Audio, a transformer that generates audio tokens from visual frames and from optional text prompt using iterative parallel decoding. The audio tokens are converted to a waveform by pretrained neural codec. Experiments show that when VATT is compared to existing video-to-audio generation methods in objective metrics, it achieves competitive performance when the audio caption is not provided. When the audio caption is provided as a prompt, VATT achieves even more refined performance (lowest KLD score of 1.41). Furthermore, subjective studies show that VATT Audio has been chosen as preferred generated audio than audio generated by existing methods. VATT enables controllable video-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking novel applications such as text-guided video-to-audio generation and video-to-audio captioning.

Paper Structure

This paper contains 19 sections, 2 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: VATT is a flexible audio generative model capable of generating audio in two modes: i) When a silent video is the sole input, the model generates the audio along with a caption describing the possible audio that could match the video. ii) When in addition to the video, a text prompt is provided, the model generates audio aligned with both the video and the given text prompt.
  • Figure 2: Two stages of VATT system training pipeline: (1) Video-to-Caption stage that maps video features into an audio caption through LLM. (2) Video + Text to Audio stage that learns to generate audio tokens through masked tokens prediction conditioned on Stage (1) features.
  • Figure 3: Audio Tokens Decoder: VATT Audio is a bi-directional transformer that models the audio tokens and the conditioning inputs (LLM hidden states) jointly. We extract the part that corresponds to the audio features and apply $L$ Linear layers in parallel to perform classification on masked tokens at each codebook layer.
  • Figure 4: Qualitative evaluation results: Pairwise Comparison of generated audio VATT v.s other methods comparing Fidelity and Relevance aspects.
  • Figure 5: Qualitative samples that showcase text controllability: For same video inputs, VATT is able to generate different sounds that align with the additional text prompts, showcasing its capability of performing controllable generation.
  • ...and 4 more figures