Towards Leveraging Contrastively Pretrained Neural Audio Embeddings for Recommender Tasks
Florian Grötschla, Luca Strässle, Luca A. Lanzendörfer, Roger Wattenhofer
TL;DR
This work tackles cold-start in music recommendation by evaluating contrastively pretrained neural audio embeddings, notably CLAP, within a graph-based artist-relationship framework. By updating the OLGA-style dataset and comparing CLAP against AcousticBrainz and mood/theme features across varying graph layers, the study demonstrates that CLAP embeddings yield stronger predictive signals as the graph captures larger neighborhoods. Key findings show that CLAP outperforms traditional features, with performance gaps widening as more GNN layers are used, and that feature combinations offer gains at shallow layers. The results suggest that contrastively pretrained audio-text representations can robustly enhance content-informed recommendations, informing future hybrid graph-based systems for music discovery and artist similarity tasks.
Abstract
Music recommender systems frequently utilize network-based models to capture relationships between music pieces, artists, and users. Although these relationships provide valuable insights for predictions, new music pieces or artists often face the cold-start problem due to insufficient initial information. To address this, one can extract content-based information directly from the music to enhance collaborative-filtering-based methods. While previous approaches have relied on hand-crafted audio features for this purpose, we explore the use of contrastively pretrained neural audio embedding models, which offer a richer and more nuanced representation of music. Our experiments demonstrate that neural embeddings, particularly those generated with the Contrastive Language-Audio Pretraining (CLAP) model, present a promising approach to enhancing music recommendation tasks within graph-based frameworks.
