Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition
Jule Valendo Halim, Siyi Wang, Hong Jia, Ting Dang
TL;DR
The paper addresses ambiguous emotion recognition in speech by leveraging end-to-end speech foundation models (SFMs). It formalizes the task as predicting an ambiguous distribution $p(\hat{y}|\bm{x}_t,\bm{\theta})$ and compares it to ground-truth distributions from multiple annotators. It introduces two approaches—text-level articulation and token-level conceptualization—to extract emotion distributions, with the token-level method using intermediate logits and averaging across tokens to form posterior distributions via $\phi^{e_n}=\frac{z^{e_n}}{\sum_{n=1}^N z^{e_n}}$. Results show that token-level distributions robustly capture ambiguity and outperform text-based outputs, suggesting SFMs encode intrinsic ambiguity knowledge useful for emotion-aware systems in HCI and mental-health contexts.
Abstract
Emotional intelligence in conversational AI is crucial across domains like human-computer interaction. While numerous models have been developed, they often overlook the complexity and ambiguity inherent in human emotions. In the era of large speech foundation models (SFMs), understanding their capability in recognizing ambiguous emotions is essential for the development of next-generation emotion-aware models. This study examines the effectiveness of SFMs in ambiguous emotion recognition. We designed prompts for ambiguous emotion prediction and introduced two novel approaches to infer ambiguous emotion distributions: one analysing generated text responses and the other examining the internal processing of SFMs through token-level logits. Our findings suggest that while SFMs may not consistently generate accurate text responses for ambiguous emotions, they can interpret such emotions at the token level based on prior knowledge, demonstrating robustness across different prompts.
