Table of Contents
Fetching ...

On the Emotion Understanding of Synthesized Speech

Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao

Abstract

Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.

On the Emotion Understanding of Synthesized Speech

Abstract

Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
Paper Structure (38 sections, 4 equations, 13 figures, 5 tables)

This paper contains 38 sections, 4 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Speech emotion recognition accuracy on TESS and CREMA-D dataset. Synthetic speech results represent the agreement with human, as audio samples with indistinct emotional expressions were manually excluded. SER results utilizing Emotion2vec highlight a clear gap between human and synthesized speech.
  • Figure 2: The confusion matrix for speech emotion recognition is shown. The vertical axis represents the ground truth, and the horizontal axis represents the model's predictions. Sub-figures (a) and (b) show SER results on human speech, TESS and CREMA-D. Sub-figures (c) and (d) show SER results for speech synthesized by two TTS models from LLM-generated text. Sub-figures (e) and (f) represent the same results as (c) and (d) after filtering out weak emotional expression. Sub-figures (g) and (h) show the same as (e) and (f), but with synthesis by S2S LLMs.
  • Figure 3: To mitigate the impact of text distribution generated by LLMs, we investigated speech emotion recognition performance on identical text datasets.
  • Figure 4: Confusion matrix of speech emotion recognition results for synthetic speech based on TESS text. The vertical axis represents the ground truth, while the horizontal axis represents the model’s predictions. Sub-figs (a) and (b) show SER results for two TTS models, while (c) and (d) show SER results for two S2S models.
  • Figure 5: Speech emotion recognition accuracy of SLMs. Solid colors and shaded areas represent the SER results of Qwen3-omni and GPT-4o Audio respectively.
  • ...and 8 more figures