Table of Contents
Fetching ...

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang, Yunhai Tong

TL;DR

This work systematically investigates whether audio–video joint denoising benefits video generation beyond synchrony. It introduces AVFullDiT, a parameter-efficient architecture that reuses pre-trained T2V and T2A backbones with AVFull-Attention and AVSyncRoPE to enable joint denoising and cross-modal learning. Across two diverse datasets, the T2AV model consistently improves video quality, motion realism, and physical commonsense, even when video metrics alone are the primary focus. Ablation studies confirm the core architectural choices and highlight the privileged role of audio in shaping grounded, causally-consistent world dynamics. Overall, the findings support the premise that incorporating audio signals strengthens multimodal world models for video generation and points to future directions in unified perception-and-generation systems.

Abstract

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

TL;DR

This work systematically investigates whether audio–video joint denoising benefits video generation beyond synchrony. It introduces AVFullDiT, a parameter-efficient architecture that reuses pre-trained T2V and T2A backbones with AVFull-Attention and AVSyncRoPE to enable joint denoising and cross-modal learning. Across two diverse datasets, the T2AV model consistently improves video quality, motion realism, and physical commonsense, even when video metrics alone are the primary focus. Ablation studies confirm the core architectural choices and highlight the privileged role of audio in shaping grounded, causally-consistent world dynamics. Overall, the findings support the premise that incorporating audio signals strengthens multimodal world models for video generation and points to future directions in unified perception-and-generation systems.

Abstract

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Paper Structure

This paper contains 20 sections, 12 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Conceptual comparison between T2V and T2AV diagrams. (a) T2V training denoises video latents with video-only supervision. (b) In inference, T2V may misinterpret motion because there is insufficient evidence linking visual appearances to world physics. (c) T2AV training jointly denoises audio and video latents with audio–video supervision. (d) In inference, T2AV produces physically plausible motion with synchronized audio. Audio helps video generation models understand the world.
  • Figure 2: The architecture of AVFullDiT and Audio-Video Full Attention. (a) AVFullDiT reuses pre-trained T2V/T2A early towers and stacks joint blocks that predict video/audio velocities under a unified flow-matching loss. (b) AVFull-Attention performs symmetric MHSA over the concatenated audio–video token sequence using the video width as the joint dimension; audio projections are expanded with small adapter matrices. The attended sequence is split and projected back per modality.
  • Figure 3: The architecture of AVSyncRoPE. (a–b) Vanilla RoPEs of the pre-trained video and audio DiTs live on different temporal scales. (c) We rescale audio positions so that video and audio tokens are aligned in real-time. This improves video learning, along with the side benefit of tighter A/V synchrony.
  • Figure 4: Example of wrong video prompt annotation in TheGreatestHits. The bolded video prompts are the incorrect part, while the audio prompts indicate the correct generation result for T2AV.
  • Figure 5: Validation loss comparison between T2AV and T2V.
  • ...and 3 more figures