Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Yong Wang; Cheng Lu; Hailun Lian; Yan Zhao; Björn Schuller; Yuan Zong; Wenming Zheng

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Yong Wang, Cheng Lu, Hailun Lian, Yan Zhao, Björn Schuller, Yuan Zong, Wenming Zheng

TL;DR

The paper addresses speech emotion recognition (SER) by leveraging a hierarchical Swin-Transformer tailored for speech, capturing multi-scale emotional cues across time-frequency patches. The proposed Speech Swin-Transformer comprises four stages, each with local windows and shifted windows Transformers followed by patch merging, enabling progressive receptive-field expansion and cross-patch interactions on log-Mel spectrogram inputs ${x \in \mathbb{R}^{b \times c \times f \times d}}$. Empirical results on IEMOCAP and CASIA using Leave-One-Speaker-Out validation demonstrate state-of-the-art performance, with WAR/UAR of $75.22\%/65.94\%$ on IEMOCAP and $54.33\%/54.33\%$ on CASIA, and hierarchical feature maps showing stage-wise specialization to emotion-related spectro-temporal patterns. The work highlights the value of integrating local and global patch correlations in SER and suggests future boundary modeling in both time and frequency domains to further improve discrimination across emotions.

Abstract

Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

TL;DR

. Empirical results on IEMOCAP and CASIA using Leave-One-Speaker-Out validation demonstrate state-of-the-art performance, with WAR/UAR of

on IEMOCAP and

on CASIA, and hierarchical feature maps showing stage-wise specialization to emotion-related spectro-temporal patterns. The work highlights the value of integrating local and global patch correlations in SER and suggests future boundary modeling in both time and frequency domains to further improve discrimination across emotions.

Abstract

Paper Structure (13 sections, 10 equations, 3 figures, 1 table)

This paper contains 13 sections, 10 equations, 3 figures, 1 table.

Introduction
Proposed Method
Local windows Transformer
Shifted windows Transformer
Patch Merging Module
Experiments
Experimental Databases
Experimental Protocol
Experimental Setup
Results and Analysis
Hierarchical Feature map visualizations
Conclusion
Acknowledgements

Figures (3)

Figure 1: Overall architecture of Speech Swin-Transformer for speech emotion recognition (SER).
Figure 2: Confusion matrices of Speech Swin-Transformer.
Figure 3: Visualizations of hierarchical feature maps generated by the different stages of Speech Swin-Transformer.

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

TL;DR

Abstract

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)