Table of Contents
Fetching ...

SonicVisionLM: Playing Sound with Vision Language Models

Zhifeng Xie, Shengye Yu, Qile He, Mengtian Li

TL;DR

SonicVisionLM addresses the challenge of video-to-audio generation by decoupling visual understanding from audio synthesis and introducing time-controlled diffusion. By using a vision-language model to extract on-screen sound cues, a timestamp detector to pinpoint timing, and a time-conditioned latent diffusion model with an adapter for text-guided audio, the approach achieves highly synchronized, editable, and diverse sound generation for silent videos. The key contributions include the Visual-to-Audio Event Understanding module, the Sound Event Timestamp Detection module, the Audio Time-condition Embedding with a Time-controllable Adapter, and the CondPromptBank dataset for training, all leading to state-of-the-art results in both conditional and unconditional generation scenarios. The work has practical impact for video post-production, enabling automatic, user-tunable sound design that aligns closely with visuals while supporting off-screen ambience enhancements.

Abstract

There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/

SonicVisionLM: Playing Sound with Vision Language Models

TL;DR

SonicVisionLM addresses the challenge of video-to-audio generation by decoupling visual understanding from audio synthesis and introducing time-controlled diffusion. By using a vision-language model to extract on-screen sound cues, a timestamp detector to pinpoint timing, and a time-conditioned latent diffusion model with an adapter for text-guided audio, the approach achieves highly synchronized, editable, and diverse sound generation for silent videos. The key contributions include the Visual-to-Audio Event Understanding module, the Sound Event Timestamp Detection module, the Audio Time-condition Embedding with a Time-controllable Adapter, and the CondPromptBank dataset for training, all leading to state-of-the-art results in both conditional and unconditional generation scenarios. The work has practical impact for video post-production, enabling automatic, user-tunable sound design that aligns closely with visuals while supporting off-screen ambience enhancements.

Abstract

There has been a growing interest in the task of generating sound for silent videos, primarily because of its practicality in streamlining video post-production. However, existing methods for video-sound generation attempt to directly create sound from visual representations, which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper, we present SonicVisionLM, a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video, we use the capabilities of powerful VLMs. When provided with a silent video, our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs, we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio, enhancing synchronization with the visuals, and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/
Paper Structure (18 sections, 10 equations, 11 figures, 5 tables)

This paper contains 18 sections, 10 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: A model implements the automatic detection of on-screen sound generation and accepts the user's editing of text and time in the off-screen section. On-screen sound refers to audio that originates from visible actions within the video frame. Off-screen sound is not directly observable on the screen.
  • Figure 2: SonicVisionLM's framework. SonicVisionLM presents a composite framework designed to automatically recognize on-screen sounds coupled with a user-interactive module for editing off-screen sounds. The blue dashed box and arrows in the figure represent the visual automation workflow: First, a silent video goes into the system to determine the occurring events (text) and their timing (time). Then, this information conditions the generation of sounds matching the screen. The purple dotted box and arrows show how users can modify or add off-screen sounds.
  • Figure 3: Sound Event Timestamp Detection Module. The network analyzes the video's features to output a binary vector corresponding to the video's frame count. Within this vector, sections marked in white (value of 1) mean sound presence, and those in black (value of 0) indicate sound absence.
  • Figure 4: Conditional Generation Task Qualitative Results. The red dashed boxes are the conditional audio inputs and generated results for CondFoleyGen, and the blue dashed boxes are the conditional text inputs and results corresponding to SonicVisionLM.
  • Figure 5: Unconditional Generation Task Qualitative Results. The left example is from CountixAV, and the right one is from Greatest Hits. We're comparing them side by side. The dashed box highlights examples of both good and bad results we generated.
  • ...and 6 more figures