Table of Contents
Fetching ...

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan

TL;DR

MuMu-LLaMA addresses the challenge of multi-modal music understanding and generation by fusing pre-trained music, image, and video encoders with a large language model via modality-aware adapters and a dedicated audio-token conditioning pathway. It introduces a 167.69-hour multi-modal dataset and a three-stage, LoRA-based training regime to align LLM representations with music generation decoders (MusicGen/AudioLDM2). Across music understanding, text-to-music, prompt-based editing, and image/video-to-music tasks, MuMu-LLaMA achieves state-of-the-art performance on objective metrics and achieves strong subjective preference in user studies. The work demonstrates a scalable data-centric approach to joint understanding and generation in music, with potential impact on video production, interactive media, and creative AI tools.

Abstract

Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models

TL;DR

MuMu-LLaMA addresses the challenge of multi-modal music understanding and generation by fusing pre-trained music, image, and video encoders with a large language model via modality-aware adapters and a dedicated audio-token conditioning pathway. It introduces a 167.69-hour multi-modal dataset and a three-stage, LoRA-based training regime to align LLM representations with music generation decoders (MusicGen/AudioLDM2). Across music understanding, text-to-music, prompt-based editing, and image/video-to-music tasks, MuMu-LLaMA achieves state-of-the-art performance on objective metrics and achieves strong subjective preference in user studies. The work demonstrates a scalable data-centric approach to joint understanding and generation in music, with potential impact on video production, interactive media, and creative AI tools.

Abstract

Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.

Paper Structure

This paper contains 38 sections, 13 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Multi-modal music understanding and generation by our proposed MuMu-LLaMA framework.
  • Figure 2: Multi-modal Music Understanding and Generation Model (MuMu-LLaMA). This model framework includes four core components: (1) Pre-trained feature encoders that process inputs from diverse modalities including music, images, and videos. (2) Understanding adapters that integrate these features into a coherent representation suitable for the LLaMA model. (3) The LLaMA model, which contextualizes and interprets the integrated information. (4) An output projection layer that translates the contextual understanding into outputs for the music generation decoder.
  • Figure 3: Distribution of instrument categories in our four curated datasets: (a) MUCaps reveals a broad diversity of instruments with a long-tail distribution. (b) MUEdit - A/D/R shows a relatively even distribution of add, delete, and replace manipulations across various instruments. (c) MUEdit - Speed & Pitch demonstrates a consistent distribution of speed and pitch modifications, suggesting balanced attention to tempo and tonal adjustments. (d) MUImage & MUVideo illustrates a balanced pairing of instruments with corresponding images and videos, ensuring a wide representation within these multi-modal components.
  • Figure 4: Music Oriented Dataset. Examples from the MUCaps, MUEdit, MUImage and MUVideo datasets used to train the MuMu-LLaMA model.
  • Figure 5: Training Stage 1: The Multi-modal Understanding Adapters are trained to integrate multi-modal features into the different layers of the LLaMA model.
  • ...and 9 more figures