FLUX that Plays Music
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Junshi Huang
TL;DR
FluxMusic tackles text-to-music generation with a scalable diffusion-Transformer framework by integrating rectified flow in a latent mel-spectrogram space. The method combines a dual-stage, multimodal architecture with multiple pre-trained text encoders, then transitions to a single-stream music-focused stage, and is trained with rectified flow for improved efficiency and quality. Extensive experiments across large datasets show competitive or superior objective metrics and strong human preferences, with ablations confirming architecture and training choices. The work advances practical text-to-music synthesis and provides public code, weights, and data enhancements to facilitate further research.
Abstract
This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic. Generally, along with design in advanced Flux\footnote{https://github.com/black-forest-labs/flux} model, we transfers it into a latent VAE space of mel-spectrum. It involves first applying a sequence of independent attention to the double text-music stream, followed by a stacked single music stream for denoised patch prediction. We employ multiple pre-trained text encoders to sufficiently capture caption semantic information as well as inference flexibility. In between, coarse textual information, in conjunction with time step embeddings, is utilized in a modulation mechanism, while fine-grained textual details are concatenated with the music patch sequence as inputs. Through an in-depth study, we demonstrate that rectified flow training with an optimized architecture significantly outperforms established diffusion methods for the text-to-music task, as evidenced by various automatic metrics and human preference evaluations. Our experimental data, code, and model weights are made publicly available at: \url{https://github.com/feizc/FluxMusic}.
