CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Shangda Wu; Yashan Wang; Ruibin Yuan; Zhancheng Guo; Xu Tan; Ge Zhang; Monan Zhou; Jing Chen; Xuefeng Mu; Yuejie Gao; Yuanliang Dong; Jiafeng Liu; Xiaobing Li; Feng Yu; Maosong Sun

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, Monan Zhou, Jing Chen, Xuefeng Mu, Yuejie Gao, Yuanliang Dong, Jiafeng Liu, Xiaobing Li, Feng Yu, Maosong Sun

TL;DR

CLaMP 2 tackles multilingual cross-modal music information retrieval across 101 languages by aligning a multilingual text representation with a dual-branch music encoder for ABC notation and MIDI. It fuses GPT-4 refined metadata with an XLM-R-based text encoder and an extended M3-based music encoder, trained on 1.5 million ABC-MIDI-text triplets to achieve robust cross-lingual retrieval and classification. The approach delivers state-of-the-art performance on multilingual semantic search and cross-modal music tasks, while mitigating textual noise and language imbalance. This work sets a new global MIR standard and points toward future integration with audio and visual modalities for richer cross-cultural music experiences.

Abstract

Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

TL;DR

Abstract

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)