Table of Contents
Fetching ...

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

Peike Li, Boyu Chen, Yao Yao, Yikai Wang, Allen Wang, Alex Wang

TL;DR

JEN-1 tackles text-to-music generation with a universal diffusion framework that directly models 48kHz waveforms. It introduces an omnidirectional latent diffusion model operating on latent representations from a masked autoencoder, enabling text-guided generation, inpainting, and continuation within a single non-cascaded model. Through multi-task and in-context training, it achieves superior text-music alignment and audio quality while maintaining efficiency, outperforming state-of-the-art baselines on MusicCaps with strong human judgments. This work advances controllable, high-fidelity music generation and opens pathways for zero-shot creative applications.

Abstract

Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures and high sampling rate requirements. Despite the task's significance, prevailing generative models exhibit limitations in music quality, computational efficiency, and generalization. This paper introduces JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a diffusion model incorporating both autoregressive and non-autoregressive training. Through in-context learning, JEN-1 performs various generation tasks including text-guided music generation, music inpainting, and continuation. Evaluations demonstrate JEN-1's superior performance over state-of-the-art methods in text-music alignment and music quality while maintaining computational efficiency. Our demos are available at https://jenmusic.ai/audio-demos

JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models

TL;DR

JEN-1 tackles text-to-music generation with a universal diffusion framework that directly models 48kHz waveforms. It introduces an omnidirectional latent diffusion model operating on latent representations from a masked autoencoder, enabling text-guided generation, inpainting, and continuation within a single non-cascaded model. Through multi-task and in-context training, it achieves superior text-music alignment and audio quality while maintaining efficiency, outperforming state-of-the-art baselines on MusicCaps with strong human judgments. This work advances controllable, high-fidelity music generation and opens pathways for zero-shot creative applications.

Abstract

Music generation has attracted growing interest with the advancement of deep generative models. However, generating music conditioned on textual descriptions, known as text-to-music, remains challenging due to the complexity of musical structures and high sampling rate requirements. Despite the task's significance, prevailing generative models exhibit limitations in music quality, computational efficiency, and generalization. This paper introduces JEN-1, a universal high-fidelity model for text-to-music generation. JEN-1 is a diffusion model incorporating both autoregressive and non-autoregressive training. Through in-context learning, JEN-1 performs various generation tasks including text-guided music generation, music inpainting, and continuation. Evaluations demonstrate JEN-1's superior performance over state-of-the-art methods in text-music alignment and music quality while maintaining computational efficiency. Our demos are available at https://jenmusic.ai/audio-demos
Paper Structure (13 sections, 5 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 5 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Illustration of the JEN-1 multi-task training strategy, including the text-guided music generation task, the music inpainting task, and the music continuation task. JEN-1 achieves the in-context learning task generalization by concatenating the noise and masked audio in a channel-wise manner. JEN-1 integrates both the bidirectional mode to gather comprehensive context and the unidirectional mode to capture sequential dependency.
  • Figure 2: Illustration of bidirectional mode and unidirectional mode for convolutional block and transformer block. In the unidirectional mode, we use causal padding in the convolutional block and attend the self-attention mask only to the left context in the transformer block.