Table of Contents
Fetching ...

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Shangda Wu, Yashan Wang, Ruibin Yuan, Zhancheng Guo, Xu Tan, Ge Zhang, Monan Zhou, Jing Chen, Xuefeng Mu, Yuejie Gao, Yuanliang Dong, Jiafeng Liu, Xiaobing Li, Feng Yu, Maosong Sun

TL;DR

CLaMP 2 tackles multilingual cross-modal music information retrieval across 101 languages by aligning a multilingual text representation with a dual-branch music encoder for ABC notation and MIDI. It fuses GPT-4 refined metadata with an XLM-R-based text encoder and an extended M3-based music encoder, trained on 1.5 million ABC-MIDI-text triplets to achieve robust cross-lingual retrieval and classification. The approach delivers state-of-the-art performance on multilingual semantic search and cross-modal music tasks, while mitigating textual noise and language imbalance. This work sets a new global MIR standard and points toward future integration with audio and visual modalities for richer cross-cultural music experiences.

Abstract

Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

TL;DR

CLaMP 2 tackles multilingual cross-modal music information retrieval across 101 languages by aligning a multilingual text representation with a dual-branch music encoder for ABC notation and MIDI. It fuses GPT-4 refined metadata with an XLM-R-based text encoder and an extended M3-based music encoder, trained on 1.5 million ABC-MIDI-text triplets to achieve robust cross-lingual retrieval and classification. The approach delivers state-of-the-art performance on multilingual semantic search and cross-modal music tasks, while mitigating textual noise and language imbalance. This work sets a new global MIR standard and points toward future integration with audio and visual modalities for richer cross-cultural music experiences.

Abstract

Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To address these issues, we introduce CLaMP 2, a system compatible with 101 languages that supports both ABC notation (a text-based musical notation format) and MIDI (Musical Instrument Digital Interface) for music information retrieval. CLaMP 2, pre-trained on 1.5 million ABC-MIDI-text triplets, includes a multilingual text encoder and a multimodal music encoder aligned via contrastive learning. By leveraging large language models, we obtain refined and consistent multilingual descriptions at scale, significantly reducing textual noise and balancing language distribution. Our experiments show that CLaMP 2 achieves state-of-the-art results in both multilingual semantic search and music classification across modalities, thus establishing a new standard for inclusive and global music information retrieval.

Paper Structure

This paper contains 22 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: CLaMP 2 is a cross-modal MIR model that uses contrastive learning to link multilingual text and multimodal music data. It employs GPT-4 to refine the multilingual corpus, reducing noise and achieving a more balanced language distribution. The refined text data is then encoded by a multilingual text encoder. Meanwhile, music data in both ABC notation (sheet music) and MIDI (performance data) formats is processed by a multimodal music encoder. Both encoders project data into a shared representation space to connect text and music.
  • Figure 2: The distribution of counts for different text types within the LLM-processed pre-training dataset.
  • Figure 3: The amount of data for 97 languages found in the original metadata, displayed in order of magnitude.
  • Figure 4: Count of text entries for 100 non-English languages generated by GPT-4.
  • Figure 5: MRR scores across six non-English languages for (a) WikiMT and (b) MidiCaps benchmarks. BLEU scores below each language provide additional context on translation quality.
  • ...and 5 more figures