Table of Contents
Fetching ...

Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition

Jule Valendo Halim, Siyi Wang, Hong Jia, Ting Dang

TL;DR

The paper addresses ambiguous emotion recognition in speech by leveraging end-to-end speech foundation models (SFMs). It formalizes the task as predicting an ambiguous distribution $p(\hat{y}|\bm{x}_t,\bm{\theta})$ and compares it to ground-truth distributions from multiple annotators. It introduces two approaches—text-level articulation and token-level conceptualization—to extract emotion distributions, with the token-level method using intermediate logits and averaging across tokens to form posterior distributions via $\phi^{e_n}=\frac{z^{e_n}}{\sum_{n=1}^N z^{e_n}}$. Results show that token-level distributions robustly capture ambiguity and outperform text-based outputs, suggesting SFMs encode intrinsic ambiguity knowledge useful for emotion-aware systems in HCI and mental-health contexts.

Abstract

Emotional intelligence in conversational AI is crucial across domains like human-computer interaction. While numerous models have been developed, they often overlook the complexity and ambiguity inherent in human emotions. In the era of large speech foundation models (SFMs), understanding their capability in recognizing ambiguous emotions is essential for the development of next-generation emotion-aware models. This study examines the effectiveness of SFMs in ambiguous emotion recognition. We designed prompts for ambiguous emotion prediction and introduced two novel approaches to infer ambiguous emotion distributions: one analysing generated text responses and the other examining the internal processing of SFMs through token-level logits. Our findings suggest that while SFMs may not consistently generate accurate text responses for ambiguous emotions, they can interpret such emotions at the token level based on prior knowledge, demonstrating robustness across different prompts.

Token-Level Logits Matter: A Closer Look at Speech Foundation Models for Ambiguous Emotion Recognition

TL;DR

The paper addresses ambiguous emotion recognition in speech by leveraging end-to-end speech foundation models (SFMs). It formalizes the task as predicting an ambiguous distribution and compares it to ground-truth distributions from multiple annotators. It introduces two approaches—text-level articulation and token-level conceptualization—to extract emotion distributions, with the token-level method using intermediate logits and averaging across tokens to form posterior distributions via . Results show that token-level distributions robustly capture ambiguity and outperform text-based outputs, suggesting SFMs encode intrinsic ambiguity knowledge useful for emotion-aware systems in HCI and mental-health contexts.

Abstract

Emotional intelligence in conversational AI is crucial across domains like human-computer interaction. While numerous models have been developed, they often overlook the complexity and ambiguity inherent in human emotions. In the era of large speech foundation models (SFMs), understanding their capability in recognizing ambiguous emotions is essential for the development of next-generation emotion-aware models. This study examines the effectiveness of SFMs in ambiguous emotion recognition. We designed prompts for ambiguous emotion prediction and introduced two novel approaches to infer ambiguous emotion distributions: one analysing generated text responses and the other examining the internal processing of SFMs through token-level logits. Our findings suggest that while SFMs may not consistently generate accurate text responses for ambiguous emotions, they can interpret such emotions at the token level based on prior knowledge, demonstrating robustness across different prompts.

Paper Structure

This paper contains 18 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: System overview. Speech utterances are processed by SFMs to generate emotion distributions, which are then compared with the ground truth inferred from $M$ human annotators.
  • Figure 2: Framework for ambiguous emotion recognition using SFMs. By providing a prompt alongside speech, it enables the extraction of both the generated text and posterior probabilities at i) text-level and ii) token-level, respectively.
  • Figure 3: Performance comparison utilizing logits of i) both emotion-related text and numerical percentage and ii) only emotion-related text.