Table of Contents
Fetching ...

Temporal-Frequency State Space Duality: An Efficient Paradigm for Speech Emotion Recognition

Jiaqi Zhao, Fei Wang, Kun Li, Yanyan Wei, Shengeng Tang, Shu Zhao, Xiao Sun

TL;DR

SER is challenged by variability in temporal dynamics and the importance of frequency envelope structures. The paper introduces TF-Mamba, a Bi-Domain SSD-based block that jointly models temporal and frequency cues through a Temporal-Aware Module and a Frequency Filter Module, with CMDT loss to tighten intra-class clustering and maximize inter-class separation in the complex frequency domain. Empirical results on IEMOCAP and MELD show state-of-the-art performance with fewer parameters and lower latency, validating both accuracy and efficiency improvements. The approach offers a scalable, practical solution for real-world speech emotion recognition in interactive systems and applications requiring robust, low-latency inference.

Abstract

Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba surpasses existing methods in terms of model size and latency, providing a more practical solution for future SER applications.

Temporal-Frequency State Space Duality: An Efficient Paradigm for Speech Emotion Recognition

TL;DR

SER is challenged by variability in temporal dynamics and the importance of frequency envelope structures. The paper introduces TF-Mamba, a Bi-Domain SSD-based block that jointly models temporal and frequency cues through a Temporal-Aware Module and a Frequency Filter Module, with CMDT loss to tighten intra-class clustering and maximize inter-class separation in the complex frequency domain. Empirical results on IEMOCAP and MELD show state-of-the-art performance with fewer parameters and lower latency, validating both accuracy and efficiency improvements. The approach offers a scalable, practical solution for real-world speech emotion recognition in interactive systems and applications requiring robust, low-latency inference.

Abstract

Speech Emotion Recognition (SER) plays a critical role in enhancing user experience within human-computer interaction. However, existing methods are overwhelmed by temporal domain analysis, overlooking the valuable envelope structures of the frequency domain that are equally important for robust emotion recognition. To overcome this limitation, we propose TF-Mamba, a novel multi-domain framework that captures emotional expressions in both temporal and frequency dimensions.Concretely, we propose a temporal-frequency mamba block to extract temporal- and frequency-aware emotional features, achieving an optimal balance between computational efficiency and model expressiveness. Besides, we design a Complex Metric-Distance Triplet (CMDT) loss to enable the model to capture representative emotional clues for SER. Extensive experiments on the IEMOCAP and MELD datasets show that TF-Mamba surpasses existing methods in terms of model size and latency, providing a more practical solution for future SER applications.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visualization of time-aligned speech intensity, pitch, and frequency spectrum for a real sample from the IEMOCAP dataset busso2008iemocap. The frequency spectrum reveals finer structures of tone, timbre, and other envelope structures.
  • Figure 2: The proposed TF-Mamba framework introduces an innovative multi-domain learning paradigm designed to precisely capture speech emotion expressions in both temporal and frequency domains. TF-Mamba optimizes computational efficiency and model performance by leveraging an efficient Bi-Domain SSD mechanism and integrating temporal perception and frequency filtering modules. Besides, the CMDT loss enhances the clustering of emotional samples and the separation of emotions, improving emotional discrimination and model robustness.
  • Figure 3: Our TF-Mamba achieves the SOTA performance on the SER task while being computationally efficient.
  • Figure 4: The top displays the feature token intensity comparison before and after the Temporal-Aware Module, while the bottom shows the spectrum comparison before and after the Frequency Filter Module.