Table of Contents
Fetching ...

Exploring How Audio Effects Alter Emotion with Foundation Models

Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

TL;DR

The paper investigates how common audio effects reshape perceived emotion in music by leveraging foundation models pretrained on multimodal data. Using three architectures (MERT, CLAP, Qwen-Audio) and six FX across EMOPIA, DEAM, and witheFlow, the authors probe changes in performance, emotion predictions, and embedding trajectories with controlled FX manipulations and real-world sound chains. Key findings show distortion amplifies Anger and reduces Calmness, while other FX introduce variability in predictions; embedding-space analyses reveal model-dependent sensitivities, with real-world FX chains producing the most pronounced shifts. These insights inform music cognition and affective computing, highlighting how production choices translate into systematic changes in emotion representation within foundation models.

Abstract

Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.

Exploring How Audio Effects Alter Emotion with Foundation Models

TL;DR

The paper investigates how common audio effects reshape perceived emotion in music by leveraging foundation models pretrained on multimodal data. Using three architectures (MERT, CLAP, Qwen-Audio) and six FX across EMOPIA, DEAM, and witheFlow, the authors probe changes in performance, emotion predictions, and embedding trajectories with controlled FX manipulations and real-world sound chains. Key findings show distortion amplifies Anger and reduces Calmness, while other FX introduce variability in predictions; embedding-space analyses reveal model-dependent sensitivities, with real-world FX chains producing the most pronounced shifts. These insights inform music cognition and affective computing, highlighting how production choices translate into systematic changes in emotion representation within foundation models.

Abstract

Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.

Paper Structure

This paper contains 9 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Radar plots of emotion predictions for CLAP, Qwen, and MERT across three audio effects for the EMOPIA dataset. Each level in the plots depicts the distribution of emotions, normalised based on the greatest value in a given plot.
  • Figure 2: UMAP visualization of foundation model embeddings for the EMOPIA dataset, showing trajectories generated for each input identity after applying audio FX with the intensity ranging from 1 to 10.
  • Figure 3: UMAP visualization of foundation model embeddings for the witheFlow dataset, showing trajectories generated for each input after applying real-world scenario audio FX.