MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models
Shansong Liu, Atin Sakkeer Hussain, Qilong Wu, Chenshuo Sun, Ying Shan
TL;DR
MuMu-LLaMA addresses the challenge of multi-modal music understanding and generation by fusing pre-trained music, image, and video encoders with a large language model via modality-aware adapters and a dedicated audio-token conditioning pathway. It introduces a 167.69-hour multi-modal dataset and a three-stage, LoRA-based training regime to align LLM representations with music generation decoders (MusicGen/AudioLDM2). Across music understanding, text-to-music, prompt-based editing, and image/video-to-music tasks, MuMu-LLaMA achieves state-of-the-art performance on objective metrics and achieves strong subjective preference in user studies. The work demonstrates a scalable data-centric approach to joint understanding and generation in music, with potential impact on video production, interactive media, and creative AI tools.
Abstract
Research on large language models has advanced significantly across text, speech, images, and videos. However, multi-modal music understanding and generation remain underexplored due to the lack of well-annotated datasets. To address this, we introduce a dataset with 167.69 hours of multi-modal data, including text, images, videos, and music annotations. Based on this dataset, we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music, images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen. Our evaluation across four tasks--music understanding, text-to-music generation, prompt-based music editing, and multi-modal music generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models, showing its potential for multi-modal music applications.
