Bridging the Gap Between Semantic and User Preference Spaces for Multi-modal Music Representation Learning
Xiaofeng Pan, Jing Chen, Haitong Zhang, Menglin Xing, Jiayi Wei, Xuefeng Mu, Zhongqian Xie
TL;DR
This work tackles the gap between semantic representations (audio-text) and user preference signals in music by introducing Hierarchical Two-stage Contrastive Learning (HTCL). It first learns robust audio-text semantics with a scalable audio encoder and a pre-trained BERT text encoder via large-scale contrastive pre-training, then adapts these semantics to user preferences through contrastive fine-tuning on interaction triplets, while preserving semantic integrity. The approach is validated on real-world platform data, showing improvements in both music semantic tasks (genre/language classification) and downstream recommendation metrics, with ablations confirming the importance of text modality and user-guided fine-tuning. The results suggest HTCL offers a practical path to unified multi-modal music representations that support both understanding and personalized recommendation in large-scale systems, and the authors provide public code and datasets to aid reproducibility.
Abstract
Recent works of music representation learning mainly focus on learning acoustic music representations with unlabeled audios or further attempt to acquire multi-modal music representations with scarce annotated audio-text pairs. They either ignore the language semantics or rely on labeled audio datasets that are difficult and expensive to create. Moreover, merely modeling semantic space usually fails to achieve satisfactory performance on music recommendation tasks since the user preference space is ignored. In this paper, we propose a novel Hierarchical Two-stage Contrastive Learning (HTCL) method that models similarity from the semantic perspective to the user perspective hierarchically to learn a comprehensive music representation bridging the gap between semantic and user preference spaces. We devise a scalable audio encoder and leverage a pre-trained BERT model as the text encoder to learn audio-text semantics via large-scale contrastive pre-training. Further, we explore a simple yet effective way to exploit interaction data from our online music platform to adapt the semantic space to user preference space via contrastive fine-tuning, which differs from previous works that follow the idea of collaborative filtering. As a result, we obtain a powerful audio encoder that not only distills language semantics from the text encoder but also models similarity in user preference space with the integrity of semantic space preserved. Experimental results on both music semantic and recommendation tasks confirm the effectiveness of our method.
