Table of Contents
Fetching ...

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai

TL;DR

Kling-Foley introduces a scalable multimodal Video-to-Audio diffusion framework that jointly models video, audio, and text to generate high-fidelity, temporally synchronized sound effects and music. It combines visual semantic grounding, audio-visual synchronization, a universal latent audio codec, and a variable-duration diffusion mechanism to handle natural video lengths. A new Kling-Audio-Eval benchmark enables robust, multimodal evaluation across distribution, semantic, temporal, and audio-quality dimensions. Experimental results show state-of-the-art performance on key metrics, with practical implications for automated dubbing, game audio, and film production workflows.

Abstract

We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.

Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation

TL;DR

Kling-Foley introduces a scalable multimodal Video-to-Audio diffusion framework that jointly models video, audio, and text to generate high-fidelity, temporally synchronized sound effects and music. It combines visual semantic grounding, audio-visual synchronization, a universal latent audio codec, and a variable-duration diffusion mechanism to handle natural video lengths. A new Kling-Audio-Eval benchmark enables robust, multimodal evaluation across distribution, semantic, temporal, and audio-quality dimensions. Experimental results show state-of-the-art performance on key metrics, with practical implications for automated dubbing, game audio, and film production workflows.

Abstract

We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model that synthesizes high-quality audio synchronized with video content. In Kling-Foley, we introduce multimodal diffusion transformers to model the interactions between video, audio, and text modalities, and combine it with a visual semantic representation module and an audio-visual synchronization module to enhance alignment capabilities. Specifically, these modules align video conditions with latent audio elements at the frame level, thereby improving semantic alignment and audio-visual synchronization. Together with text conditions, this integrated approach enables precise generation of video-matching sound effects. In addition, we propose a universal latent audio codec that can achieve high-quality modeling in various scenarios such as sound effects, speech, singing, and music. We employ a stereo rendering method that imbues synthesized audio with a spatial presence. At the same time, in order to make up for the incomplete types and annotations of the open-source benchmark, we also open-source an industrial-level benchmark Kling-Audio-Eval. Our experiments show that Kling-Foley trained with the flow matching objective achieves new audio-visual SOTA performance among public models in terms of distribution matching, semantic alignment, temporal alignment and audio quality.

Paper Structure

This paper contains 31 sections, 9 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation model. Taking an input video and an optional text prompt, the model synthesizes high-fidelity audio that is semantically aligned and temporally synchronized with the video content, encompassing elements such as sound effects and background music. Significantly, Kling-Foley can produce audio sequences of arbitrary duration, dynamically adapting to the length of the input video.
  • Figure 2: The core of Kling-Foley is a multimodal-controlled flowmatching model. Text, video, and temporally extracted video frames serve as conditional inputs. The multimodal features are then fused via a Multimodal Joint Conditioning module, which feeds into the MMDit Block for processing. This module predicts VAE latents, which a pretrained mel decoder subsequently reconstructs into a monaural mel-spectrogram. The monaural spectrogram is then converted to stereo spectrogram via a Mono2Stereo module. Finally, the stereo spectrogram is passed through a vocoder to generate the output waveform.
  • Figure 3: The main body of latent audio codec is a Mel-VAE, which jointly trains a mel encoder, a mel decoder, and a discriminator. The VAE structure enables the model to learn a continuous and complete distribution of latent spaces, significantly enhancing its audio representation capabilities.
  • Figure 4: Audio and video data undergo preprocessing and quality filtering to obtain high-quality single-event audio and video segments. Subsequently, synthetic multi-event audio samples are generated through temporal augmentation, and large models are used to generate and extract keywords and classification captions for audio and video. Finally, various caption information is combined to produce the final training captions.
  • Figure 5: Category distribution of sound events in the training set. The broad coverage of real-world acoustic events ensures the diversity and generalizability required for training open-domain sound generation models.
  • ...and 1 more figures