Table of Contents
Fetching ...

Acoustic and Semantic Modeling of Emotion in Spoken Language

Soumya Dutta

TL;DR

Improved emotion transfer is demonstrated and style-transferred speech can be used for data augmentation to improve emotion recognition and a speech-driven supervised pre-training framework is introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora.

Abstract

Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora. The second part addresses emotion recognition in conversational settings. Hierarchical architectures combining cross-modal attention and mixture-of-experts fusion are developed to integrate acoustic and semantic information across conversational turns. Finally, the thesis introduces a textless and non-parallel speech-to-speech framework for emotion style transfer that enables controllable emotional transformations while preserving speaker identity and linguistic content. The results demonstrate improved emotion transfer and show that style-transferred speech can be used for data augmentation to improve emotion recognition.

Acoustic and Semantic Modeling of Emotion in Spoken Language

TL;DR

Improved emotion transfer is demonstrated and style-transferred speech can be used for data augmentation to improve emotion recognition and a speech-driven supervised pre-training framework is introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora.

Abstract

Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora. The second part addresses emotion recognition in conversational settings. Hierarchical architectures combining cross-modal attention and mixture-of-experts fusion are developed to integrate acoustic and semantic information across conversational turns. Finally, the thesis introduces a textless and non-parallel speech-to-speech framework for emotion style transfer that enables controllable emotional transformations while preserving speaker identity and linguistic content. The results demonstrate improved emotion transfer and show that style-transferred speech can be used for data augmentation to improve emotion recognition.
Paper Structure (127 sections, 28 equations, 33 figures, 24 tables)

This paper contains 127 sections, 28 equations, 33 figures, 24 tables.

Figures (33)

  • Figure 1: Illustration of a limitation in emotion understanding from spoken language by a multimodal conversational model. Although the utterance is perceived as conveying fear by six human annotators with $100\%$ agreement, Gemini $3$ (with thinking enabled) predicts a happy emotion and provides an incorrect explanation.
  • Figure 2: Summary of the thesis contributions
  • Figure 3: Road map for the thesis chapters
  • Figure 4: Block diagram of the proposed CARE model. The acoustic encoder of the model is trained with PASE+ features as targets. Blocks in blue indicate either frozen components or those with no learnable parameters. For the semantic encoder the transformer layers are frozen while the convolutional adapters are trained. As the dimension of the output from the acoustic encoder is $768$, a FC layer is attached to match the PASE+ feature dimension of $256$. This FC layer and the average pool block after the semantic encoder are not used during inference.
  • Figure 5: Performance of CARE when different acoustic targets are used. The model with eGeMAPS as features is trained similarly to that of the PASE+ baseline. All numbers are shown as the average of $5$ random initializations.
  • ...and 28 more figures