Table of Contents
Fetching ...

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song

TL;DR

VSSFlow proposes a unified, flow-based model for video-conditioned sound and speech generation that jointly handles V2S and VisualTTS within a single DiT-based architecture. By introducing a condition aggregation mechanism that leverages cross-attention for video signals and concatenation for deterministic transcript cues, it achieves strong performance across V2S and VisualTTS benchmarks without complex multi-stage training. The work demonstrates a mutual benefit from end-to-end joint training, driven by learning a shared audio prior that improves convergence and the robustness of classifier-free guidance. It also shows practical adaptability to joint sound-speech generation through fine-tuning on synthetic mixtures, underscoring the potential of unified generative models for multimodal media synthesis.

Abstract

Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.

VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

TL;DR

VSSFlow proposes a unified, flow-based model for video-conditioned sound and speech generation that jointly handles V2S and VisualTTS within a single DiT-based architecture. By introducing a condition aggregation mechanism that leverages cross-attention for video signals and concatenation for deterministic transcript cues, it achieves strong performance across V2S and VisualTTS benchmarks without complex multi-stage training. The work demonstrates a mutual benefit from end-to-end joint training, driven by learning a shared audio prior that improves convergence and the robustness of classifier-free guidance. It also shows practical adaptability to joint sound-speech generation through fine-tuning on synthetic mixtures, underscoring the potential of unified generative models for multimodal media synthesis.

Abstract

Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.

Paper Structure

This paper contains 40 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: VSSFlow: A unified generative model for video-conditioned sound and speech synthesis. (a) Video-to-Sound generation given a silent video. (b) Visual-TTS generation given a silent talking video and speech transcripts. (c) Sound-Speech Joint generation given a silent video and transcripts.
  • Figure 2: Overview of VSSFlow's architecture. VSSFlow employs cross-attention-based DiT blocks and a flow-matching paradigm, taking video CLIP representations and speech phoneme embeddings as conditional inputs. We conduct ablation studies on different condition mechanisms of DiT illustrated in section \ref{['chap:ab_on_cond']}. Variant CrossV (introducing video condition via cross-attention and speech condition via concatenation) enhances overall performance on both V2S and VisualTTS tasks.
  • Figure 3: Performance comparison of different conditioning mechanisms over training epochs. (a) shows FAD metric for V2S task, while (b) presents WER metrics for the VisualTTS task. (c) is the visualization of the attention weights in self- and cross-attention layers of DiT blocks. More metrics can be found in Appendix \ref{['appendix:exp_on_cond']}.
  • Figure 4: Impact of joint learning on the performance of sound and speech generation. The left three models are trained with different data setings: V2S only, V2S + VisualTTS, and V2S + TTS. (a.1) shows the FAD metric for the sound generation task across training steps. (a.2) compares the performance of different models (trained on three data settings for 10k steps) under varying classifier-free guidance scales. The right two models are trained with VisualTTS data only and V2S + VisualTTS data. (b) plots the WER and Speaker Similarity metrics for the VisualTTS task across training steps. More metrics can be found in Appendix \ref{['appendix:exp_on_data']}.
  • Figure 5: Performance comparison of models with different conditioning mechanisms over training epochs. (a) Left-top three plots show metrics for the V2S task, while (b) Left-bottom three plots present metrics for the VisualTTS task. (c) Visualization of the attention weights in self- and cross-attention layers of DiT blocks.
  • ...and 2 more figures