Table of Contents
Fetching ...

CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

Tristan Tsoi, Jiajun Deng, Yaolong Ju, Benno Weck, Holger Kirchhoff, Simon Lui

TL;DR

CrossMuSim tackles the problem of music similarity retrieval by leveraging cross-modal learning between text and audio. It introduces a dual-source data pipeline (online scraping and LLM-based prompting) to generate rich text descriptions for tracks, paired with a cross-modal contrastive framework that aligns text and audio in a shared embedding space using a NT-Xent-style objective. The approach uses a multilingual text encoder (distiluse-base-multilingual-cased-v2) and an audio encoder (Music Tagging Transformer), with a two-layer projection to unify representations. Empirical results across objective metrics, subjective MOS, and real-world Huawei Music A/B tests demonstrate that combining text-only aspects and captions, augmented by LLM prompting, yields significant improvements over baselines, validating the practicality of text-guided music similarity modeling for streaming platforms. The work highlights the potential of scalable semantic signals from language models to enhance music recommendation systems and cross-modal retrieval. $\mathcal{L}_{\text{NT-Xent}} = \sum_{i=1}^{N} \log \frac{\exp(z_{i,i}/\tau)}{\sum_{j=1}^{N} \exp(z_{i,j}/\tau)}$ is used to train the cross-modal objective, where $z_{i,j}$ denotes cosine similarity and $\tau$ is the temperature parameter.

Abstract

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM-Powered Text Description Sourcing and Mining

TL;DR

CrossMuSim tackles the problem of music similarity retrieval by leveraging cross-modal learning between text and audio. It introduces a dual-source data pipeline (online scraping and LLM-based prompting) to generate rich text descriptions for tracks, paired with a cross-modal contrastive framework that aligns text and audio in a shared embedding space using a NT-Xent-style objective. The approach uses a multilingual text encoder (distiluse-base-multilingual-cased-v2) and an audio encoder (Music Tagging Transformer), with a two-layer projection to unify representations. Empirical results across objective metrics, subjective MOS, and real-world Huawei Music A/B tests demonstrate that combining text-only aspects and captions, augmented by LLM prompting, yields significant improvements over baselines, validating the practicality of text-guided music similarity modeling for streaming platforms. The work highlights the potential of scalable semantic signals from language models to enhance music recommendation systems and cross-modal retrieval. is used to train the cross-modal objective, where denotes cosine similarity and is the temperature parameter.

Abstract

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs' comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

Paper Structure

This paper contains 15 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the proposed CrossMuSim framework. (a) Dual-source data acquisition module combining online scraping and LLM-based prompting for textual description generation, (b) Music similarity modeling utilizing cross-modal contrastive learning framework with text-music pairs, and (c) Inference phase for music similarity retrieval using audio modality.
  • Figure 2: Music-to-music performance of MTT baseline, online scraping, and online scraping with LLM-prompting methods in terms of a) coarse-grained and b) fine-grained music categorization levels.