Joint Learning of Emotions in Music and Generalized Sounds

Federico Simonetta; Francesca Certo; Stavros Ntalampiras

Joint Learning of Emotions in Music and Generalized Sounds

Federico Simonetta, Francesca Certo, Stavros Ntalampiras

TL;DR

This work addresses cross-domain emotion recognition by investigating whether generalized sounds and music share a common emotional space along the arousal and valence axes. It builds a shared feature space using 6375 openSMILE ComParE features extracted from IADS-E and PMEmo, and evaluates ElasticNet, SVR, and AutoML within a dataset-augmentation framework controlled by $k$ and $p$ in the mixed dataset $k \times | ext{IADS-E}| + p \times | ext{PMEmo}|$. The results show that augmentation improves both arousal and valence predictions, with AutoML achieving state-of-the-art performance and arousal benefiting more from cross-domain transfer ($R^2$ > 0.15) than valence. The findings highlight the effectiveness of non-linear models in a shared affective space and suggest broader applications to include additional data classes for diverse affective tasks, offering a simple yet powerful route to enhance AER/MER systems.

Abstract

In this study, we aim to determine if generalized sounds and music can share a common emotional space, improving predictions of emotion in terms of arousal and valence. We propose the use of multiple datasets as a multi-domain learning technique. Our approach involves creating a common space encompassing features that characterize both generalized sounds and music, as they can evoke emotions in a similar manner. To achieve this, we utilized two publicly available datasets, namely IADS-E and PMEmo, following a standardized experimental protocol. We employed a wide variety of features that capture diverse aspects of the audio structure including key parameters of spectrum, energy, and voicing. Subsequently, we performed joint learning on the common feature space, leveraging heterogeneous model architectures. Interestingly, this synergistic scheme outperforms the state-of-the-art in both sound and music emotion prediction. The code enabling full replication of the presented experimental pipeline is available at https://github.com/LIMUNIMI/MusicSoundEmotions.

Joint Learning of Emotions in Music and Generalized Sounds

TL;DR

and

in the mixed dataset

. The results show that augmentation improves both arousal and valence predictions, with AutoML achieving state-of-the-art performance and arousal benefiting more from cross-domain transfer (

> 0.15) than valence. The findings highlight the effectiveness of non-linear models in a shared affective space and suggest broader applications to include additional data classes for diverse affective tasks, offering a simple yet powerful route to enhance AER/MER systems.

Abstract

Paper Structure (7 sections, 1 equation, 5 figures, 2 tables)

This paper contains 7 sections, 1 equation, 5 figures, 2 tables.

Introduction
Methodology
Data sets
Feature Extraction
Model Selection and Validation on Combined Data Sets
Experimental Set-Up and Results
Conclusion

Figures (5)

Figure 1: Distribution of the ratings in both datasets on the valence-arousal plane.
Figure 2: The overall pipeline: first, clustering is suitably used for applying stratified sampling; then, datasets are sub-sampled according to the parameters $k$ and $p$; finally, the model learns the merged sub-populations and is tested on the original test folds.
Figure 3: $R^2$ values according to the various training sets. The test set of the left plot was IADS-E (no music), while for right plot PMEmo. Negative values are truncated at -1.
Figure 4: $R^2$ of the AutoML optimization when different augmentation ratios are used in the train set, i.e. for different values of $k$ and $p$ in the formula $k\times\textit{IADS-E} + p\times\textit{PMEmo}$. Each line represents a different test set, while IADS-E dataset was used without the music samples.
Figure 5: $R^2$ of the AutoML optimization when different augmentation ratios are used in the train set, i.e. for different values of $k$ and $p$ in the formula $k\times\textit{IADS-E} + p\times\textit{PMEmo}$. Both lines represent $R^2$ scores obtained on the PMEmo validation folds. The baseline is obtained by adding a randomized version of IADS-E to the train set in which the labels were synthesized by uniform random sampling.

Joint Learning of Emotions in Music and Generalized Sounds

TL;DR

Abstract

Joint Learning of Emotions in Music and Generalized Sounds

Authors

TL;DR

Abstract

Table of Contents

Figures (5)