Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition
Alexandra Saliba, Yuanchao Li, Ramon Sanabria, Catherine Lai
TL;DR
This work investigates Acoustic Word Embeddings (AWEs) derived from self-supervised speech models, focusing on their layer-wise relationship to lexical representations and their effectiveness for Speech Emotion Recognition (SER). By computing layer-wise similarity between HuBERT-based AWEs and BERT embeddings, the authors reveal that AWEs encode a distinct acoustic context with limited lexical alignment, peaking around layer 9. They conduct SER experiments on IEMOCAP and ESD using AWEs, raw HuBERT, and Mel features, exploring concatenation and cross-attention fusion with BERT embeddings. Across datasets, AWEs demonstrate competitive or superior performance in certain settings, especially when fused thoughtfully, and layer-wise analyses show that AWEs can leverage early layers for acoustic-context advantages while remaining robust in later layers. The findings provide practical guidance for leveraging AWEs in SER and inform broader design choices for integrating self-supervised speech representations with lexical information.
Abstract
The efficacy of self-supervised speech models has been validated, yet the optimal utilization of their representations remains challenging across diverse tasks. In this study, we delve into Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks. AWEs have previously shown utility in capturing acoustic discriminability. In light of this, we propose measuring layer-wise similarity between AWEs and word embeddings, aiming to further investigate the inherent context within AWEs. Moreover, we evaluate the contribution of AWEs, in comparison to other types of speech features, in the context of Speech Emotion Recognition (SER). Through a comparative experiment and a layer-wise accuracy analysis on two distinct corpora, IEMOCAP and ESD, we explore differences between AWEs and raw self-supervised representations, as well as the proper utilization of AWEs alone and in combination with word embeddings. Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive SER accuracies by appropriately employing AWEs.
