Joint sentiment analysis of lyrics and audio in music

Lea Schaab; Anna Kruspe

Joint sentiment analysis of lyrics and audio in music

Lea Schaab, Anna Kruspe

TL;DR

The paper investigates automated sentiment analysis in music by jointly analyzing audio and lyrics, comparing unimodal results, and evaluating fusion strategies. Using VA and MIREX-like datasets annotated and mapped to four quadrants $Q1$–$Q4$ of Russell's circumplex model via the ANEW lexicon, it benchmarks audio models (USC SAIL Short-Chunk CNN) and lyric models (Lyrics Model, SiEBERT, Poem-based, and 6-Emotion) and tests three fusion schemes, with a 60% audio / 40% text weighting delivering the best overall performance. Results show that while text-based lyric models often excel in valence detection and certain polarity, combining modalities yields superior accuracy, particularly for positive emotions, and that misclassifications reveal annotation and taxonomy inconsistencies. The study highlights challenges such as subjectivity and data scarcity and argues for high-quality, bimodal datasets and new multimodal models to better capture the interplay between lyrics and audio in expressing emotion.

Abstract

Sentiment or mood can express themselves on various levels in music. In automatic analysis, the actual audio data is usually analyzed, but the lyrics can also play a crucial role in the perception of moods. We first evaluate various models for sentiment analysis based on lyrics and audio separately. The corresponding approaches already show satisfactory results, but they also exhibit weaknesses, the causes of which we examine in more detail. Furthermore, different approaches to combining the audio and lyrics results are proposed and evaluated. Considering both modalities generally leads to improved performance. We investigate misclassifications and (also intentional) contradictions between audio and lyrics sentiment more closely, and identify possible causes. Finally, we address fundamental problems in this research area, such as high subjectivity, lack of data, and inconsistency in emotion taxonomies.

Joint sentiment analysis of lyrics and audio in music

TL;DR

–

of Russell's circumplex model via the ANEW lexicon, it benchmarks audio models (USC SAIL Short-Chunk CNN) and lyric models (Lyrics Model, SiEBERT, Poem-based, and 6-Emotion) and tests three fusion schemes, with a 60% audio / 40% text weighting delivering the best overall performance. Results show that while text-based lyric models often excel in valence detection and certain polarity, combining modalities yields superior accuracy, particularly for positive emotions, and that misclassifications reveal annotation and taxonomy inconsistencies. The study highlights challenges such as subjectivity and data scarcity and argues for high-quality, bimodal datasets and new multimodal models to better capture the interplay between lyrics and audio in expressing emotion.

Abstract

Paper Structure (3 sections, 4 figures, 1 table)

This paper contains 3 sections, 4 figures, 1 table.

VA Data Set
MIREX-like Data Set
Data Preprocessing

Figures (4)

Figure 1: The 2D valence-arousal emotion space yang2012machine
Figure 2: Results of the audio-only model. Left: Quadrants of the circumplex model, right: Binary results for valence and arousal.
Figure 3: Results of the various text-only models.
Figure 4: Results of different model fusion strategies.

Joint sentiment analysis of lyrics and audio in music

TL;DR

Abstract

Joint sentiment analysis of lyrics and audio in music

Authors

TL;DR

Abstract

Table of Contents

Figures (4)