Table of Contents
Fetching ...

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

TL;DR

AudioX introduces a unified Diffusion Transformer for anything-to-audio and music generation that handles text, video, image, and audio inputs. By using modality-specific encoders and a latent diffusion model conditioned on a multi-modal embedding, and applying input masking across modalities, AudioX learns robust cross-modal representations and delivers high-quality audio aligned to diverse inputs. The authors curate large captioned datasets (vggsound-caps and V2M-caps) to support training and evaluation, and demonstrate state-of-the-art results across text-to-audio, video-to-audio, text-and-video-to-audio, text-to-music, video-to-music, and text-and-video-to-music tasks, including inpainting and music completion. The approach advances multi-modal grounding in audio generation, enabling flexible, real-time applications, with code and datasets to be released for reproducibility and further research.

Abstract

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/

AudioX: Diffusion Transformer for Anything-to-Audio Generation

TL;DR

AudioX introduces a unified Diffusion Transformer for anything-to-audio and music generation that handles text, video, image, and audio inputs. By using modality-specific encoders and a latent diffusion model conditioned on a multi-modal embedding, and applying input masking across modalities, AudioX learns robust cross-modal representations and delivers high-quality audio aligned to diverse inputs. The authors curate large captioned datasets (vggsound-caps and V2M-caps) to support training and evaluation, and demonstrate state-of-the-art results across text-to-audio, video-to-audio, text-and-video-to-audio, text-to-music, video-to-music, and text-and-video-to-music tasks, including inpainting and music completion. The approach advances multi-modal grounding in audio generation, enabling flexible, real-time applications, with code and datasets to be released for reproducibility and further research.

Abstract

Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/

Paper Structure

This paper contains 21 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: (a) Overview of AudioX, illustrating its capabilities across various tasks. (b) Radar chart comparing the performance of different methods across multiple benchmarks. AudioX demonstrates superior Inception Scores (IS) across a diverse set of datasets in audio and music generation tasks.
  • Figure 2: Overview of the automated caption generation pipeline. For each video-audio clip (top), Qwen2-Audio uses dataset-provided keywords to produce an audio caption. For each video-music pair (bottom), it describes key attributes (e.g. genre, instruments, mood, tempo) to form a music caption.
  • Figure 3: The AudioX Framework. This figure depicts the AudioX framework, which employs specialized encoders and a DiT-based approach with input masking to generate high-quality audio, unifying diverse input modalities for comprehensive audio and music creation.
  • Figure 4: User study results of generated audio and music. The values represent the average OVL and REL scores across Text-to-Audio (on AudioCaps), Text-to-Music (on MusicCaps), Video-to-Audio (on VGGSound), Video-to-Music (on V2M-bench).
  • Figure 5: Ablation study of mask ratios for each modality, with mask ratios varying from 0.2, 0.4, 0.6 to 0.8. The values represent the average Inception Score (IS) across Text-to-Audio, Text-to-Music, Video-to-Audio, Video-to-Music, and Audio Inpainting tasks.
  • ...and 3 more figures