Table of Contents
Fetching ...

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Junwon Lee, Juhan Nam, Jiyoung Lee

TL;DR

SelVA tackles the challenge of generating only a user-specified sound from a multi-object video by turning text prompts into explicit selectors of audible semantics. It introduces a text-conditioned video encoder with a cross-attention mechanism and learnable [SUP] tokens to refine visual-text grounding, paired with a diffusion-based multimodal audio generator trained via a two-stage, self-augmented pipeline. A dedicated VGG-MonoAudio benchmark demonstrates state-of-the-art performance across audio quality, semantic alignment, and temporal synchronization, with human studies corroborating improvements over prior methods. The work advances practical, controllable V2A for multimedia production by enabling selective, source-specific audio synthesis without requiring costly per-source supervision.

Abstract

This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

TL;DR

SelVA tackles the challenge of generating only a user-specified sound from a multi-object video by turning text prompts into explicit selectors of audible semantics. It introduces a text-conditioned video encoder with a cross-attention mechanism and learnable [SUP] tokens to refine visual-text grounding, paired with a diffusion-based multimodal audio generator trained via a two-stage, self-augmented pipeline. A dedicated VGG-MonoAudio benchmark demonstrates state-of-the-art performance across audio quality, semantic alignment, and temporal synchronization, with human studies corroborating improvements over prior methods. The work advances practical, controllable V2A for multimedia production by enabling selective, source-specific audio synthesis without requiring costly per-source supervision.

Abstract

This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.

Paper Structure

This paper contains 50 sections, 8 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: SelVA turns text prompts into precise selectors of sound sources within a video. The text-conditioned video encoder extracts intent-focused video features that condition the generator to synthesize only the user-specified sound source (e.g., 'cat meowing' vs. 'dog barking').
  • Figure 2: The overall training pipeline of SelVA. We learn a text-conditioned video encoder with a teacher-student distillation manner (left; first stage), and train an audio generator that conditions on text and isolated visual cues for the sound source (right; second stage). ${\colorbox{red!20}{{$\text{Learnable}$}}}$ layers are marked with , while ${\colorbox{blue!20}{{$\text{frozen}$}}}$ layers are marked with .
  • Figure 3: Attention visualization for [eos] token over auto-mixed frame in the last block without (left) / with (right) [SUP] tokens. Each subcaption denotes the corresponding target prompt.
  • Figure 4: Examples of selective generation with real-world videos. The white dotted curve is the root-mean-squared audio amplitude.
  • Figure 5: Human study results on VGG-MonoAudio. The GT results (i.e., real sound) show oracle performance. SelVA outperforms state-of-the-art methods, including MMAudio and VOS baselines.
  • ...and 9 more figures