Table of Contents
Fetching ...

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, Zhao Zhong

TL;DR

HunyuanVideo-Foley tackles the TV2A problem by building a scalable, end-to-end framework that fuses video, text, and audio through a dual-stream multimodal diffusion transformer. A large 100k-hour TV2A dataset is created via a comprehensive automated data pipeline, and a Representation Alignment (REPA) loss guides diffusion training by aligning internal representations with pre-trained audio features. The model employs dual-phase attentions, interleaved RoPE, and synchronization-conditioned modulation to balance visual and textual semantics while improving temporal coherence, augmented by a DAC-VAE-based audio encoder/decoder. Across Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench, HunyuanVideo-Foley achieves new state-of-the-art results in audio fidelity, visual-semantic alignment, and temporal alignment, validating the effectiveness of REPA and large-scale data for Foley-quality TV2A generation.

Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

TL;DR

HunyuanVideo-Foley tackles the TV2A problem by building a scalable, end-to-end framework that fuses video, text, and audio through a dual-stream multimodal diffusion transformer. A large 100k-hour TV2A dataset is created via a comprehensive automated data pipeline, and a Representation Alignment (REPA) loss guides diffusion training by aligning internal representations with pre-trained audio features. The model employs dual-phase attentions, interleaved RoPE, and synchronization-conditioned modulation to balance visual and textual semantics while improving temporal coherence, augmented by a DAC-VAE-based audio encoder/decoder. Across Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench, HunyuanVideo-Foley achieves new state-of-the-art results in audio fidelity, visual-semantic alignment, and temporal alignment, validating the effectiveness of REPA and large-scale data for Foley-quality TV2A generation.

Abstract

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: https://szczesnys.github.io/hunyuanvideo-foley/.

Paper Structure

This paper contains 36 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Data pipeline for filtering video-audio data. The workflow illustrates the processing steps from the raw video database to the filtered video-audio database.
  • Figure 2: Overview of the HunyuanVideo-Foley model architecture. The proposed model integrates encoded text (CLAP), visual (SigLIP-2), and audio (DAC-VAE) inputs through a hybrid framework with $N_1$ multimodal transformer blocks followed by $N_2$ unimodal transformer blocks. The hybrid transformer blocks are modulated and gated with synchronization features and timestep embeddings. A pre-trained ATST-Frame is used to compute REPA loss with latnet representations from a unimodal transformer block. The generated audio latent are decoded into audio waveforms by the DAC-VAE decoder.
  • Figure 3: Radar Chart of Video-to-Audio Evaluation. It contains the results on three evaluation set: Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench, demonstrating that HunyuanVideo-Foley achieves comprehensive superiority.
  • Figure 4: Left: The video sequence illustrates a walking scenario on icy surfaces, where our proposed method achieves precise temporal alignment for both the initiation/termination timing and the duration of each step. Right: Spectral analysis confirms accurate synchronization with the temporal characteristics of human movements in the skateboarding scenario.
  • Figure 5: Left: In the ice hockey scenario involving rapid rhythmic auditory cues, our spectral analysis demonstrates robust performance in detecting subtle motion variations synchronized with the sound patterns. Right: Our method preserves the full spectral representation in complex skiing scenario where motion-sound alignment is less distinct, with no discernible degradation of high-frequency components in the spectrogram.