Table of Contents
Fetching ...

Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

Ziqian Zhang, Min Huang, Zhongzhe Xiao

TL;DR

This work investigates the role of speech production physiology in speech emotion recognition (SER) by introducing the STEM-E2VA dataset, which provides audio, electroglottography (EGG) for phonation excitation, and electromagnetic articulography (EMA) for articulatory kinematics. It demonstrates that fusing phonation excitation and articulatory movement information with speech improves 7-class SER, and it explores the practicality of using estimated physiological data derived from speech via IAIF and acoustic-to-articulatory inversion (AAI) with HuBERT features and a Temporal Convolutional Network. Ground-truth data show substantial gains from multimodal fusion (up to ~88.4% accuracy), while estimated data achieve modest gains (tri-modal ~82.7%), indicating feasibility with room for improvement in inversion and fusion techniques. Overall, the study highlights the complementary value of physiological information for SER and provides a dataset and methodology for advancing physiology-informed multimodal SER toward real-world application.

Abstract

Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.

Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

TL;DR

This work investigates the role of speech production physiology in speech emotion recognition (SER) by introducing the STEM-E2VA dataset, which provides audio, electroglottography (EGG) for phonation excitation, and electromagnetic articulography (EMA) for articulatory kinematics. It demonstrates that fusing phonation excitation and articulatory movement information with speech improves 7-class SER, and it explores the practicality of using estimated physiological data derived from speech via IAIF and acoustic-to-articulatory inversion (AAI) with HuBERT features and a Temporal Convolutional Network. Ground-truth data show substantial gains from multimodal fusion (up to ~88.4% accuracy), while estimated data achieve modest gains (tri-modal ~82.7%), indicating feasibility with room for improvement in inversion and fusion techniques. Overall, the study highlights the complementary value of physiological information for SER and provides a dataset and methodology for advancing physiology-informed multimodal SER toward real-world application.

Abstract

Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.

Paper Structure

This paper contains 12 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Configuration of sensors in the EMA system.
  • Figure 2: Confusion matrix (in %) for (a) the uni-modal scenario using speech and (b) the best-performing tri-modal scenario.