Table of Contents
Fetching ...

MusFlow: Multimodal Music Generation via Conditional Flow Matching

Jiahao Song, Yuzhao Wang

TL;DR

MusFlow tackles the challenge of multimodal music generation by mapping diverse conditions (images, stories, captions) into the CLAP audio space via lightweight MLP adapters and generating audio with Conditional Flow Matching in a pretrained VAE latent space, enabling fast, end-to-end synthesis. A Multi-Agent Workflow constructs MMusSet, a 33.3k-sample dataset containing image/story/caption/music quadruples to support training and evaluation without heavy language-model pipelines. Empirical results across caption-, story-, image-, and multimodal-to-music tasks show MusFlow achieving state-of-the-art or competitive audio quality (low FAD/KL, strong CLAP/IB alignment) and high subjective quality, while ablations highlight the value of the Alignment Module and random conditioning. The work advances practical multimodal music generation and provides the dataset, code, and methodology to facilitate broader adoption and further research in multimedia sound design.

Abstract

Music generation aims to create music segments that align with human aesthetics based on diverse conditional information. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise or time to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching. We employ multiple Multi-Layer Perceptrons (MLPs) to align multimodal conditional information into the audio's CLAP embedding space. Conditional flow matching is trained to reconstruct the compressed Mel-spectrogram in the pretrained VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi-agent collaboration, we construct an intelligent data annotation workflow centered around a fine-tuned Qwen2-VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image-to-music, story-to-music, caption-to-music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high-quality music pieces whether the input conditions are unimodal or multimodal. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples, code and dataset are available at musflow.github.io.

MusFlow: Multimodal Music Generation via Conditional Flow Matching

TL;DR

MusFlow tackles the challenge of multimodal music generation by mapping diverse conditions (images, stories, captions) into the CLAP audio space via lightweight MLP adapters and generating audio with Conditional Flow Matching in a pretrained VAE latent space, enabling fast, end-to-end synthesis. A Multi-Agent Workflow constructs MMusSet, a 33.3k-sample dataset containing image/story/caption/music quadruples to support training and evaluation without heavy language-model pipelines. Empirical results across caption-, story-, image-, and multimodal-to-music tasks show MusFlow achieving state-of-the-art or competitive audio quality (low FAD/KL, strong CLAP/IB alignment) and high subjective quality, while ablations highlight the value of the Alignment Module and random conditioning. The work advances practical multimodal music generation and provides the dataset, code, and methodology to facilitate broader adoption and further research in multimedia sound design.

Abstract

Music generation aims to create music segments that align with human aesthetics based on diverse conditional information. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise or time to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching. We employ multiple Multi-Layer Perceptrons (MLPs) to align multimodal conditional information into the audio's CLAP embedding space. Conditional flow matching is trained to reconstruct the compressed Mel-spectrogram in the pretrained VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi-agent collaboration, we construct an intelligent data annotation workflow centered around a fine-tuned Qwen2-VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image-to-music, story-to-music, caption-to-music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high-quality music pieces whether the input conditions are unimodal or multimodal. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples, code and dataset are available at musflow.github.io.

Paper Structure

This paper contains 25 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Multimodal music generation by our proposed MusFlow model.
  • Figure 2: An illustration of our MusFlow framework for multimodal music generation.
  • Figure 3: An illustration of our proposed Multi-Agent Workflow for dataset creation.
  • Figure 4: Ablation study results of the Alignment Module. CLAP$_{music}$ refers to using the target audio's CLAP embedding as conditioning during training; CLAP$_{caption}$ and CLIP$_{story}$ represent directly using the original features as conditioning; Avg denotes using the average of the original features as conditioning across multimodal task; Aligned refers to the proposed method, where the Alignment Module is applied to obtain the conditional embedding.