Table of Contents
Fetching ...

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, Meng Cao

TL;DR

This work tackles Text-to-Sounding-Video (T2SV) by addressing two core issues in dual-tower architectures: conditioning interference from shared captions and the design of cross-modal interaction. It introduces Hierarchical Visual-Grounded Captioning (HVGC) to produce modality-pure captions for video and audio, and BridgeDiT with Dual CrossAttention (DCA) to enable symmetric, bidirectional fusion between video and audio towers. Extensive experiments on AVSync15, VGGSound-SS, and Landscape show state-of-the-art performance on most metrics, corroborated by human evaluations that favor BridgeDiT. The results illuminate the importance of disentangled conditioning and bidirectional interaction for robust, temporally and semantically coherent T2SV generation, and suggest directions for data augmentation and RLHF-driven refinements.

Abstract

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

TL;DR

This work tackles Text-to-Sounding-Video (T2SV) by addressing two core issues in dual-tower architectures: conditioning interference from shared captions and the design of cross-modal interaction. It introduces Hierarchical Visual-Grounded Captioning (HVGC) to produce modality-pure captions for video and audio, and BridgeDiT with Dual CrossAttention (DCA) to enable symmetric, bidirectional fusion between video and audio towers. Extensive experiments on AVSync15, VGGSound-SS, and Landscape show state-of-the-art performance on most metrics, corroborated by human evaluations that favor BridgeDiT. The results illuminate the importance of disentangled conditioning and bidirectional interaction for robust, temporally and semantically coherent T2SV generation, and suggest directions for data augmentation and RLHF-driven refinements.

Abstract

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

Paper Structure

This paper contains 42 sections, 13 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Examples of sounding videos generated by our BridgeDiT model, showcasing high quality, temporal synchronization, and text alignment. Our method generates high-fidelity video frames and detailed audio spectrograms that remain faithful to the given text prompts. Critically, as highlighted in the dashed boxes, the generated audio and video are precisely synchronized, demonstrating strong temporal coherence between visual events and their corresponding sounds. More cases are shown in the anonymous demo page https://bridgedit-t2sv.github.io.
  • Figure 2: Our three-stage Hierarchical Visual-Grounded Captioning (HVGC) framework generates disentangled modality-pure text captions. First, a Vision-Language Large Model (VLLM) produces a detailed video caption ($T_V$). Subsequently, a Large Language Model (LLM) extracts relevant audio tags from this video caption. Finally, the framework leverages both the visual context in $T_V$ and the extracted audio tags to generate a pure audio caption ($T_A$).
  • Figure 2: Performance on VGGSound-SS and Landscape. AV denotes AV-Align metric here. Best and second-best are highlighted.
  • Figure 3: The BridgeDiT Architecture. (a): The overall dual-tower architecture. Parallel video and audio DiT streams are connected by our proposed BridgeDiT Block at specific layers. Right: Details of fusion strategies within the block, showcasing our proposed Dual Cross-Attention (b) alongside the Full-Attention (c) and Additive Fusion (d) baselines.
  • Figure 4: Comparing different fusion mechanisms. Our DCA fusion mechanism outperforms all other baselines in both AV-Align and VA-IB Score.
  • ...and 5 more figures