Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation
Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang, Yunhai Tong
TL;DR
This work systematically investigates whether audio–video joint denoising benefits video generation beyond synchrony. It introduces AVFullDiT, a parameter-efficient architecture that reuses pre-trained T2V and T2A backbones with AVFull-Attention and AVSyncRoPE to enable joint denoising and cross-modal learning. Across two diverse datasets, the T2AV model consistently improves video quality, motion realism, and physical commonsense, even when video metrics alone are the primary focus. Ablation studies confirm the core architectural choices and highlight the privileged role of audio in shaping grounded, causally-consistent world dynamics. Overall, the findings support the premise that incorporating audio signals strengthens multimodal world models for video generation and points to future directions in unified perception-and-generation systems.
Abstract
Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
