Table of Contents
Fetching ...

Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition

Xu Zhang, Longbing Cao, Runze Yang, Zhangkai Wu

TL;DR

PhysioSER addresses the interpretability and robustness gaps in speech emotion recognition by integrating a physiology-informed vocal representation quartet $\{M,\rho,f_{\text{inst}},\tau_g\}$ with a frozen SSL backbone. The Quartet is encoded via a Hamilton-structured Quaternion Spectrotemporal Encoder to model cross-component amplitude–phase dynamics, while a Contrastive Projection and Alignment framework couples this with latent SSL representations. Across 14 datasets and 10 languages, PhysioSER consistently improves over backbone-only baselines and demonstrates parameter efficiency, especially on weaker backbones and non-English data. Real-time deployment on the Ameca humanoid robot validates practical applicability for emotion-aware human–robot interaction, making the approach robust, efficient, and interpretable for embodied AI tasks.

Abstract

Speech emotion recognition (SER) is essential for humanoid robot tasks such as social robotic interactions and robotic psychological diagnosis, where interpretable and efficient models are critical for safety and performance. Existing deep models trained on large datasets remain largely uninterpretable, often insufficiently modeling underlying emotional acoustic signals and failing to capture and analyze the core physiology of emotional vocal behaviors. Physiological research on human voices shows that the dynamics of vocal amplitude and phase correlate with emotions through the vocal tract filter and the glottal source. However, most existing deep models solely involve amplitude but fail to couple the physiological features of and between amplitude and phase. Here, we propose PhysioSER, a physiology-informed vocal spectrotemporal representation learning method, to address these issues with a compact, plug-and-play design. PhysioSER constructs amplitude and phase views informed by voice anatomy and physiology (VAP) to complement SSL models for SER. This VAP-informed framework incorporates two parallel workflows: a vocal feature representation branch to decompose vocal signals based on VAP, embed them into a quaternion field, and use Hamilton-structured quaternion convolutions for modeling their dynamic interactions; and a latent representation branch based on a frozen SSL backbone. Then, utterance-level features from both workflows are aligned by a Contrastive Projection and Alignment framework, followed by a shallow attention fusion head for SER classification. PhysioSER is shown to be interpretable and efficient for SER through extensive evaluations across 14 datasets, 10 languages, and 6 backbones, and its practical efficacy is validated by real-time deployment on a humanoid robotic platform.

Learning Physiology-Informed Vocal Spectrotemporal Representations for Speech Emotion Recognition

TL;DR

PhysioSER addresses the interpretability and robustness gaps in speech emotion recognition by integrating a physiology-informed vocal representation quartet with a frozen SSL backbone. The Quartet is encoded via a Hamilton-structured Quaternion Spectrotemporal Encoder to model cross-component amplitude–phase dynamics, while a Contrastive Projection and Alignment framework couples this with latent SSL representations. Across 14 datasets and 10 languages, PhysioSER consistently improves over backbone-only baselines and demonstrates parameter efficiency, especially on weaker backbones and non-English data. Real-time deployment on the Ameca humanoid robot validates practical applicability for emotion-aware human–robot interaction, making the approach robust, efficient, and interpretable for embodied AI tasks.

Abstract

Speech emotion recognition (SER) is essential for humanoid robot tasks such as social robotic interactions and robotic psychological diagnosis, where interpretable and efficient models are critical for safety and performance. Existing deep models trained on large datasets remain largely uninterpretable, often insufficiently modeling underlying emotional acoustic signals and failing to capture and analyze the core physiology of emotional vocal behaviors. Physiological research on human voices shows that the dynamics of vocal amplitude and phase correlate with emotions through the vocal tract filter and the glottal source. However, most existing deep models solely involve amplitude but fail to couple the physiological features of and between amplitude and phase. Here, we propose PhysioSER, a physiology-informed vocal spectrotemporal representation learning method, to address these issues with a compact, plug-and-play design. PhysioSER constructs amplitude and phase views informed by voice anatomy and physiology (VAP) to complement SSL models for SER. This VAP-informed framework incorporates two parallel workflows: a vocal feature representation branch to decompose vocal signals based on VAP, embed them into a quaternion field, and use Hamilton-structured quaternion convolutions for modeling their dynamic interactions; and a latent representation branch based on a frozen SSL backbone. Then, utterance-level features from both workflows are aligned by a Contrastive Projection and Alignment framework, followed by a shallow attention fusion head for SER classification. PhysioSER is shown to be interpretable and efficient for SER through extensive evaluations across 14 datasets, 10 languages, and 6 backbones, and its practical efficacy is validated by real-time deployment on a humanoid robotic platform.
Paper Structure (25 sections, 24 equations, 5 figures, 4 tables)

This paper contains 25 sections, 24 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Structure of the PhysioSER. The model consists of two parallel workflows: (a) an upper latent speech representation workflow to extract general features from the raw waveform using a frozen SSL backbone, followed by a latent representation transform; and (b) a lower vocal feature representation workflow to decompose vocal signals based on voice anatomy and physiology (VAP)-informed knowledge. In (b), the Hamilton structured Quaternion Spectrotemporal Encoder (QSE) embeds a physiology-aligned quartet—log-magnitude ($M$), log-magnitude rate ($\rho$) (spectral flux), instantaneous frequency ($f_{\mathrm{inst}}$), and group delay ($\tau_g$)—and models their structured, dynamic interactions. The two branches are separately summarized into utterance-level embeddings and aligned by the Contrastive Projection and Alignment (CPA) framework via projection heads with an InfoNCE objective. Finally, a shallow Transformer encoder fuses the aligned latent and vocal representations for SER.
  • Figure 2: Hyperparameter sensitivity analysis on CREMA-D. (a) Performances on different temperature $\eta$. (b) Performances on different CPA alignment dimension $d_{\mathrm{align}}$. (c) Performances and complexity on different QSE depth $L$.
  • Figure 3: Visualization of test set feature clusters on CREMA-D.
  • Figure 4: Analysis of the vocal features quartet $\{M, \rho, f_{\mathrm{inst}}, \tau_g\}$ across emotions on the CREMA-D testset.
  • Figure 5: Real-time deployment of PhysioSER on the Ameca. The system maps speech-driven emotional states to corresponding facial expressions.