Table of Contents
Fetching ...

In-Context Audio Control of Video Diffusion Transformers

Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue

TL;DR

ICAC tackles the challenge of embedding a time-synchronous audio signal into a unified in-context video diffusion transformer. By systematically comparing 2D cross-attention, 2D self-attention, and fully unified 3D self-attention, and introducing Masked 3D Attention with an efficient Flash Attention implementation, the paper achieves stable convergence and strong lip-sync with audio-conditioned video and reference images. The two-stage training curriculum for the 3D-attention variant and comprehensive experiments on Celeb-V and MochaBench demonstrate that deeper audio integration yields better results while maintaining stability, matching or surpassing larger, specialized baselines. This work advances scalable, multi-modal video synthesis by enabling high-fidelity, audio-driven talking-head generation within a unified transformer framework.

Abstract

Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.

In-Context Audio Control of Video Diffusion Transformers

TL;DR

ICAC tackles the challenge of embedding a time-synchronous audio signal into a unified in-context video diffusion transformer. By systematically comparing 2D cross-attention, 2D self-attention, and fully unified 3D self-attention, and introducing Masked 3D Attention with an efficient Flash Attention implementation, the paper achieves stable convergence and strong lip-sync with audio-conditioned video and reference images. The two-stage training curriculum for the 3D-attention variant and comprehensive experiments on Celeb-V and MochaBench demonstrate that deeper audio integration yields better results while maintaining stability, matching or surpassing larger, specialized baselines. This work advances scalable, multi-modal video synthesis by enabling high-fidelity, audio-driven talking-head generation within a unified transformer framework.

Abstract

Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overall architecture of ICAC. It takes text, audio, image, and noise as inputs. The text is processed by a text encoder, the audio by Wav2Vec2, and the image by a VAE to be encoded, respectively. These encoded conditional inputs, along with the noise, are fed into a DiT model. Finally, the output from the DiT is decoded by a 3D VAE to generate the final Video.
  • Figure 2: Conceptual visualization of the compared attention configurations including (a) 2D cross-attention, (b) 2D self-attention without updating audio features, (c) 2D self-attention and (d) (masked) 3D attention. On the right of (d), we show the mask pattern of masked 3D attention (e), consisting of video-to-video attention (full), audio-to-audio attention (blocked), video-to-audio attention (blocked), and audio-to-video attention (blocked). Finally, the RoPE indices are shown in sub-figure (f).
  • Figure 3: Generated results from Celeb-V (the first three rows) and MochaBench (the rest).