Table of Contents
Fetching ...

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

Ohad Cohen, Gershon Hazan, Sharon Gannot

TL;DR

This work tackles speech emotion recognition in reverberant environments by leveraging multi-microphone inputs. It extends the HTS-AT transformer to handle multi-channel audio using two fusion strategies: Patch-Embed Summation and Average Mel-Spectrograms. The authors fine-tune a pre-trained HTS-AT on three datasets (RAVDESS, IEMOCAP, CREMA-D) under ACE RIR reverberation and show consistent but modest improvements over single-channel baselines. The approach offers robust SER performance with flexible microphone numbers and low extra computational cost, suggesting practical deployment in real-world, noisy settings.

Abstract

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

TL;DR

This work tackles speech emotion recognition in reverberant environments by leveraging multi-microphone inputs. It extends the HTS-AT transformer to handle multi-channel audio using two fusion strategies: Patch-Embed Summation and Average Mel-Spectrograms. The authors fine-tune a pre-trained HTS-AT on three datasets (RAVDESS, IEMOCAP, CREMA-D) under ACE RIR reverberation and show consistent but modest improvements over single-channel baselines. The approach offers robust SER performance with flexible microphone numbers and low extra computational cost, suggesting practical deployment in real-world, noisy settings.

Abstract

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Scheme of Patch-Embed Summation.
  • Figure 2: Scheme of Average Mel-Spectrograms.
  • Figure 3: Accuracy and Confidence Interval on test sets convolved with ACE RIR Lecture Room 2 ($T_{60}=1220~ms$). The results of two HTS-AT fine-tuned on either clean or simulated RIRs datasets compared with 2022dalia trained on the clean datasets.