MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Zihao Deng; Yinghao Ma; Yudong Liu; Rongchen Guo; Ge Zhang; Wenhu Chen; Wenhao Huang; Emmanouil Benetos

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos

TL;DR

MusiLingo addresses the challenge of bridging music and natural language for captioning and question answering by connecting a frozen music encoder (MERT) with a frozen LLM (Vicuna) through a single trainable adapter, enhanced by temporal compression. The embedding from $M \in \mathbb{R}^{B \times T \times D}$ is projected into the LLM's text space and compressed to $T' = \lceil T / t \rceil$, enabling efficient cross-modal fusion while preserving backbone knowledge. The model is pretrained on the LP-MusicCaps-MSD dataset and finetuned with the MusicInstruct dataset to support diverse music queries, including long-form, open-ended questions. Empirical results show competitive music QA performance and state-of-the-art-like results on several music captioning metrics, with ablations highlighting the impact of fine-tuning data choice on downstream tasks and highlighting practical potential for music discovery and conversational AI in MIR.

Abstract

Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

TL;DR

is projected into the LLM's text space and compressed to

, enabling efficient cross-modal fusion while preserving backbone knowledge. The model is pretrained on the LP-MusicCaps-MSD dataset and finetuned with the MusicInstruct dataset to support diverse music queries, including long-form, open-ended questions. Empirical results show competitive music QA performance and state-of-the-art-like results on several music captioning metrics, with ablations highlighting the impact of fine-tuning data choice on downstream tasks and highlighting practical potential for music discovery and conversational AI in MIR.

Abstract

Paper Structure (23 sections, 1 figure, 5 tables)

This paper contains 23 sections, 1 figure, 5 tables.

Introduction
Related Work
Dataset & Evaluation Metrics
Large Dataset for Pre-training
Music Instruction Following Dataset
Collection Process
Quality Evaluation
Evaluation Metrics
Method
Model Architecture
Music-Text pre-training
Music Instruction Tuning
Experiment and Results
Experiment Setup
Result Analysis on Question-Answering
...and 8 more sections

Figures (1)

Figure 1: Overview of the MusiLingo model. Note that the backbone LLM can be easily replaced from Vicuna-7B to other LLMs.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

TL;DR

Abstract

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Authors

TL;DR

Abstract

Table of Contents

Figures (1)