HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan; Qiulin Li; Yutao Cui; Miles Yang; Yuehai Wang; Qun Yang; Jin Zhou; Zhao Zhong

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong

TL;DR

HunyuanVideo-Foley tackles the TV2A problem by building a scalable, end-to-end framework that fuses video, text, and audio through a dual-stream multimodal diffusion transformer. A large 100k-hour TV2A dataset is created via a comprehensive automated data pipeline, and a Representation Alignment (REPA) loss guides diffusion training by aligning internal representations with pre-trained audio features. The model employs dual-phase attentions, interleaved RoPE, and synchronization-conditioned modulation to balance visual and textual semantics while improving temporal coherence, augmented by a DAC-VAE-based audio encoder/decoder. Across Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench, HunyuanVideo-Foley achieves new state-of-the-art results in audio fidelity, visual-semantic alignment, and temporal alignment, validating the effectiveness of REPA and large-scale data for Foley-quality TV2A generation.

Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

TL;DR

Abstract

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)