MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Sanjoy Chowdhury; Sayan Nag; K J Joseph; Balaji Vasan Srinivasan; Dinesh Manocha

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

TL;DR

MeLFusion tackles multi-modal music synthesis by conditioning diffusion-based generation on both an input image and a text description. It introduces a visual synapse that injects image semantics into a text-to-music diffusion model via learnable per-layer cross-attention mixing with a frozen image-to-text diffusion model. The authors release MeLBench (11,250 triplets) and IMSM as new benchmarks to quantify image-music alignment, and demonstrate up to 67.98% relative gains in FAD over strong baselines on two datasets. The work suggests that direct image conditioning substantially enhances musical coherence and emotional alignment for social-media and multimedia workflows.

Abstract

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

TL;DR

Abstract

Paper Structure (33 sections, 7 equations, 11 figures, 16 tables, 2 algorithms)

This paper contains 33 sections, 7 equations, 11 figures, 16 tables, 2 algorithms.

Introduction
Related Works
Synthesizing Music from Image and Text
Extracting Visual Guidance
Text-to-Music LDM with Visual Synapse
Overall Framework
Experiments and Results
Datasets
Evaluation Metrics
Baseline Methods
Results
Discussions and Analysis
Conclusion and Future Works
More Details on TANGO++
Problem Motivation Revisited
...and 18 more sections

Figures (11)

Figure 1: We present MeLFusion, a music diffusion model equipped with a novel "visual synapse", that can effectively infuse image semantics into a text-to-music diffusion model. This task indeed requires a detailed understanding of the concepts in the image. An alternate approach like using a caption generator to convert image to text space to be further used with existing text-to-music methods leads to a sub-optimal overall audio quality (OVL) score. Our approach can knit together complementary information from both modalities to synthesize high-quality music.
Figure 2: Our approach MeLFusion generates music waveform $\bm{w}$ conditioned on an image $\bm{I}$ and a given textual instruction $\bm{Y}$. Visual semantics from $\bm{I}$ is instilled into a text-to-music diffusion model (bottom green box) using a pre-trained and frozen text-to-image diffusion model (top blue box). The image $\bm{I}$ is first DDIM inverted into a noisy latent $\bm{z}^I_T$. The self-attention features from the decoder layers of the text-to-image LDM that consumes $\bm{z}^I_T$ is infused into the cross-attention features of text-to-music LDM decoder layers, modulated by learned $\alpha$ parameters. This fusion operation that happens in the decoder (green stripes) is detailed on the right side of the figure. The music encoder projects the spectrogram representation of the music to the latent space, and the music decoder retrieves back the spectrograms. Finally, a vocoder generates the waveform $\bm{w}$ from the spectrograms. Please refer to \ref{['sec:method']} for more details.
Figure 3: The distribution of different genres in MeLBench.
Figure 4: Some image and text pairs from MeLBench. We include more examples in the Appendix.
Figure 5: A mock-up of a social media post that contains an image and associated textual content. Our approach MeLFusion, can consume such image-textual pairs as input and synthesize music that can go well with them.
...and 6 more figures

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

TL;DR

Abstract

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)