Table of Contents
Fetching ...

FLUX that Plays Music

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang

TL;DR

FluxMusic tackles text-to-music generation with a scalable diffusion-Transformer framework by integrating rectified flow in a latent mel-spectrogram space. The method combines a dual-stage, multimodal architecture with multiple pre-trained text encoders, then transitions to a single-stream music-focused stage, and is trained with rectified flow for improved efficiency and quality. Extensive experiments across large datasets show competitive or superior objective metrics and strong human preferences, with ablations confirming architecture and training choices. The work advances practical text-to-music synthesis and provides public code, weights, and data enhancements to facilitate further research.

Abstract

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.

FLUX that Plays Music

TL;DR

FluxMusic tackles text-to-music generation with a scalable diffusion-Transformer framework by integrating rectified flow in a latent mel-spectrogram space. The method combines a dual-stage, multimodal architecture with multiple pre-trained text encoders, then transitions to a single-stream music-focused stage, and is trained with rectified flow for improved efficiency and quality. Extensive experiments across large datasets show competitive or superior objective metrics and strong human preferences, with ablations confirming architecture and training choices. The work advances practical text-to-music synthesis and provides public code, weights, and data enhancements to facilitate further research.

Abstract

This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.
Paper Structure (25 sections, 3 equations, 6 figures, 4 tables)

This paper contains 25 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Model architecture of FluxMusic. We use frozen CLAP-L and T5-XXL as text encoders for conditioned caption feature extraction. The coarse text information concatenated with timestep embedding $y$ are used to modulation mechanism. The fine-grained text $c$ concatenated with music sequence $x$ are input to a stacked of double stream block and single steam blocks to predict nose in a latent VAE space.
  • Figure 2: The loss curve of different model structure with similar parameters. We can see that combine double and single stream block is much more scalable and compute efficient way for music generation model.
  • Figure 3: The loss curve of different model parameters with same structure. We can see that increase model parameters consistently improve the generative performance.
  • Figure 4: Generated mel-spectrum cases of different training steps. We plot small version of MusicFlux at every 50K training steps and we can find that the image becomes orderly and fine-grained instead of random and disorderly with the training continues.
  • Figure 5: Generated mel-spectrum cases of different model parameters. With model size increase, the resulting mel-spectrum becomes more content rich and rhythmically distinct.
  • ...and 1 more figures