Table of Contents
Fetching ...

Read, Watch and Scream! Sound Generation from Text and Video

Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee

TL;DR

ReWaS tackles the challenge of generating audio from open-world video while leveraging text prompts, by introducing a video-to-energy predictor that provides a time-varying structural cue to a robust text-to-audio diffusion model. The method combines a ControlNet-like energy adapter with AudioLDM, allowing continuous energy control and improved temporal alignment without heavy per-timestamp annotations. Empirical results on VGGSound and Greatest Hits show superior audio quality, stronger audiovisual alignment, and greater training efficiency compared to state-of-the-art baselines, with human studies corroborating improvements. This approach offers a practical, flexible framework for synchronized, multimodal sound synthesis with broad applicability in film, media, and interactive AI systems.

Abstract

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called \ours, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Code and demo are available at https://naver-ai.github.io/rewas.

Read, Watch and Scream! Sound Generation from Text and Video

TL;DR

ReWaS tackles the challenge of generating audio from open-world video while leveraging text prompts, by introducing a video-to-energy predictor that provides a time-varying structural cue to a robust text-to-audio diffusion model. The method combines a ControlNet-like energy adapter with AudioLDM, allowing continuous energy control and improved temporal alignment without heavy per-timestamp annotations. Empirical results on VGGSound and Greatest Hits show superior audio quality, stronger audiovisual alignment, and greater training efficiency compared to state-of-the-art baselines, with human studies corroborating improvements. This approach offers a practical, flexible framework for synchronized, multimodal sound synthesis with broad applicability in film, media, and interactive AI systems.

Abstract

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called \ours, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Code and demo are available at https://naver-ai.github.io/rewas.
Paper Structure (24 sections, 19 figures, 6 tables)

This paper contains 24 sections, 19 figures, 6 tables.

Figures (19)

  • Figure 1: An example of audio generation requiring both text and video control. The text instruction "dog growling" is used for the text control. The video-to-audio (V2A) im2wav or text-to-audio (T2A) liu2023audioldm generation methods cannot understand the detailed semantics from texts (the dog is growling, not barking) or video (the dog is biting something, and the alignment), respectively.
  • Figure 2: Discrete timestamp annotations vs. Continuous energy.
  • Figure 3: Energy can improve temporal alignment.
  • Figure 4: Overall architecture of ReWaS. Our model predicts energy control from a given video, and generates sound with text prompt and control condition. Red lines are used in training only, and replaced to the video-to-energy estimator $\phi$ in test time.
  • Figure 5: Qualitative comparison on VGGSound. Surprisingly, when the skateboarder jumps, only ReWaS succeded in detecting short transition (yellow box). Text prompt in is "skateboarding".
  • ...and 14 more figures