Joint sentiment analysis of lyrics and audio in music
Lea Schaab, Anna Kruspe
TL;DR
The paper investigates automated sentiment analysis in music by jointly analyzing audio and lyrics, comparing unimodal results, and evaluating fusion strategies. Using VA and MIREX-like datasets annotated and mapped to four quadrants $Q1$–$Q4$ of Russell's circumplex model via the ANEW lexicon, it benchmarks audio models (USC SAIL Short-Chunk CNN) and lyric models (Lyrics Model, SiEBERT, Poem-based, and 6-Emotion) and tests three fusion schemes, with a 60% audio / 40% text weighting delivering the best overall performance. Results show that while text-based lyric models often excel in valence detection and certain polarity, combining modalities yields superior accuracy, particularly for positive emotions, and that misclassifications reveal annotation and taxonomy inconsistencies. The study highlights challenges such as subjectivity and data scarcity and argues for high-quality, bimodal datasets and new multimodal models to better capture the interplay between lyrics and audio in expressing emotion.
Abstract
Sentiment or mood can express themselves on various levels in music. In automatic analysis, the actual audio data is usually analyzed, but the lyrics can also play a crucial role in the perception of moods. We first evaluate various models for sentiment analysis based on lyrics and audio separately. The corresponding approaches already show satisfactory results, but they also exhibit weaknesses, the causes of which we examine in more detail. Furthermore, different approaches to combining the audio and lyrics results are proposed and evaluated. Considering both modalities generally leads to improved performance. We investigate misclassifications and (also intentional) contradictions between audio and lyrics sentiment more closely, and identify possible causes. Finally, we address fundamental problems in this research area, such as high subjectivity, lack of data, and inconsistency in emotion taxonomies.
