Table of Contents
Fetching ...

Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

Yujia Sun, Zeyu Zhao, Korin Richmond, Yuanchao Li

TL;DR

This paper investigates acoustic similarity between emotional speech and music by leveraging self-supervised representations and studying cross-domain transfer between SER and MER. The authors use layerwise probing of Wav2Vec2, HuBERT, and MERT pretrained on speech or music, a two-stage domain adaptation framework with Baseline, Weighted-Sum, and PEFT strategies, and a Fréchet Audio Distance analysis to quantify cross-domain similarity across emotions. The approach reveals that SSL models encode shared acoustic cues across speech and music, but performance patterns depend on emotion; speech SSLs are more amenable to cross-domain transfer than music SSLs, and parameter-efficient fine-tuning yields notable gains. These results offer a path toward improved SER and MER through cross-domain generalization and highlight emotion-bias in SSL representations as a limitation that merits further study.

Abstract

Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.

Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

TL;DR

This paper investigates acoustic similarity between emotional speech and music by leveraging self-supervised representations and studying cross-domain transfer between SER and MER. The authors use layerwise probing of Wav2Vec2, HuBERT, and MERT pretrained on speech or music, a two-stage domain adaptation framework with Baseline, Weighted-Sum, and PEFT strategies, and a Fréchet Audio Distance analysis to quantify cross-domain similarity across emotions. The approach reveals that SSL models encode shared acoustic cues across speech and music, but performance patterns depend on emotion; speech SSLs are more amenable to cross-domain transfer than music SSLs, and parameter-efficient fine-tuning yields notable gains. These results offer a path toward improved SER and MER through cross-domain generalization and highlight emotion-bias in SSL representations as a limitation that merits further study.

Abstract

Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.
Paper Structure (20 sections, 1 equation, 3 figures, 3 tables)

This paper contains 20 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Layerwise SER and MER accuracy using SSL representations.
  • Figure 2: Layerwise SER (top) and MER (bottom) accuracy per emotion using representations from the SSL models.
  • Figure 3: Layerwise cross-domain FAD per emotion of the SSL models.