Exploring How Audio Effects Alter Emotion with Foundation Models
Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou
TL;DR
The paper investigates how common audio effects reshape perceived emotion in music by leveraging foundation models pretrained on multimodal data. Using three architectures (MERT, CLAP, Qwen-Audio) and six FX across EMOPIA, DEAM, and witheFlow, the authors probe changes in performance, emotion predictions, and embedding trajectories with controlled FX manipulations and real-world sound chains. Key findings show distortion amplifies Anger and reduces Calmness, while other FX introduce variability in predictions; embedding-space analyses reveal model-dependent sensitivities, with real-world FX chains producing the most pronounced shifts. These insights inform music cognition and affective computing, highlighting how production choices translate into systematic changes in emotion representation within foundation models.
Abstract
Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.
