Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture
Ohad Cohen, Gershon Hazan, Sharon Gannot
TL;DR
This work tackles speech emotion recognition in reverberant environments by leveraging multi-microphone inputs. It extends the HTS-AT transformer to handle multi-channel audio using two fusion strategies: Patch-Embed Summation and Average Mel-Spectrograms. The authors fine-tune a pre-trained HTS-AT on three datasets (RAVDESS, IEMOCAP, CREMA-D) under ACE RIR reverberation and show consistent but modest improvements over single-channel baselines. The approach offers robust SER performance with flexible microphone numbers and low extra computational cost, suggesting practical deployment in real-world, noisy settings.
Abstract
The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
