Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan; Xihua Wang; Zhengfeng Lai; Xin Cheng; Peng Zhang; XiaoJiang Liu; Ruihua Song; Meng Cao

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, Meng Cao

TL;DR

This work tackles Text-to-Sounding-Video (T2SV) by addressing two core issues in dual-tower architectures: conditioning interference from shared captions and the design of cross-modal interaction. It introduces Hierarchical Visual-Grounded Captioning (HVGC) to produce modality-pure captions for video and audio, and BridgeDiT with Dual CrossAttention (DCA) to enable symmetric, bidirectional fusion between video and audio towers. Extensive experiments on AVSync15, VGGSound-SS, and Landscape show state-of-the-art performance on most metrics, corroborated by human evaluations that favor BridgeDiT. The results illuminate the importance of disentangled conditioning and bidirectional interaction for robust, temporally and semantically coherent T2SV generation, and suggest directions for data augmentation and RLHF-driven refinements.

Abstract

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

TL;DR

Abstract

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)