Table of Contents
Fetching ...

Lyrics Matter: Exploiting the Power of Learnt Representations for Music Popularity Prediction

Yash Choudhary, Preeti Rao, Pushpak Bhattacharyya

TL;DR

The paper investigates predicting song popularity by elevating lyrics from a traditional feature to learned representations via large language models. It introduces HitMusicLyricNet, a multimodal architecture that fuses audio, LLM-derived lyric embeddings, and metadata through separate autoencoders for audio and lyrics followed by a fusion network. The authors demonstrate notable improvements on the SpotGenTrack dataset, with approximately 9% MAE and 20% MSE gains over the baseline, and provide extensive error and interpretability analyses to understand modality contributions. They also emphasize the need for domain-aware lyric representations and propose directions for future work, including micro-segment analysis and improved lyric modeling. Overall, the work underscores the predictive value of learned lyric representations in music popularity modeling and offers a scalable pipeline for industry use.

Abstract

Accurately predicting music popularity is a critical challenge in the music industry, offering benefits to artists, producers, and streaming platforms. Prior research has largely focused on audio features, social metadata, or model architectures. This work addresses the under-explored role of lyrics in predicting popularity. We present an automated pipeline that uses LLM to extract high-dimensional lyric embeddings, capturing semantic, syntactic, and sequential information. These features are integrated into HitMusicLyricNet, a multimodal architecture that combines audio, lyrics, and social metadata for popularity score prediction in the range 0-100. Our method outperforms existing baselines on the SpotGenTrack dataset, which contains over 100,000 tracks, achieving 9% and 20% improvements in MAE and MSE, respectively. Ablation confirms that gains arise from our LLM-driven lyrics feature pipeline (LyricsAENet), underscoring the value of dense lyric representations.

Lyrics Matter: Exploiting the Power of Learnt Representations for Music Popularity Prediction

TL;DR

The paper investigates predicting song popularity by elevating lyrics from a traditional feature to learned representations via large language models. It introduces HitMusicLyricNet, a multimodal architecture that fuses audio, LLM-derived lyric embeddings, and metadata through separate autoencoders for audio and lyrics followed by a fusion network. The authors demonstrate notable improvements on the SpotGenTrack dataset, with approximately 9% MAE and 20% MSE gains over the baseline, and provide extensive error and interpretability analyses to understand modality contributions. They also emphasize the need for domain-aware lyric representations and propose directions for future work, including micro-segment analysis and improved lyric modeling. Overall, the work underscores the predictive value of learned lyric representations in music popularity modeling and offers a scalable pipeline for industry use.

Abstract

Accurately predicting music popularity is a critical challenge in the music industry, offering benefits to artists, producers, and streaming platforms. Prior research has largely focused on audio features, social metadata, or model architectures. This work addresses the under-explored role of lyrics in predicting popularity. We present an automated pipeline that uses LLM to extract high-dimensional lyric embeddings, capturing semantic, syntactic, and sequential information. These features are integrated into HitMusicLyricNet, a multimodal architecture that combines audio, lyrics, and social metadata for popularity score prediction in the range 0-100. Our method outperforms existing baselines on the SpotGenTrack dataset, which contains over 100,000 tracks, achieving 9% and 20% improvements in MAE and MSE, respectively. Ablation confirms that gains arise from our LLM-driven lyrics feature pipeline (LyricsAENet), underscoring the value of dense lyric representations.

Paper Structure

This paper contains 19 sections, 1 equation, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Diagram of the HitMusicNet pipeline outlining the principal functionalities and data components. Image src MartnGutirrez2020AME.
  • Figure 2: Popularity Distribution in cleaned SpotGenTrack(SPD) with $\mu = 41.11$ and a standard deviation of $\sigma = 17.51$.
  • Figure 3: Block schematic of the HitMusicLyricNet architecture comprising of two Autoencoders and a Fully Connected NN predicting popularity score. 'HL' stands for high-level and 'LL' stands for low-level.
  • Figure 4: Actual (blue) vs. predicted (red) music popularity distributions on test set, showing prediction compression at both tails with aligned means ($\mu_{\text{actual}}=0.422$, $\mu_{\text{predicted}}=0.428$).
  • Figure 5: Model calibration plot showing alignment between mean predicted and actual popularity per bin.
  • ...and 10 more figures