Table of Contents
Fetching ...

Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers

Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa

TL;DR

This work tackles personalized Speech Emotion Recognition in Human-Robot Interaction by leveraging Vision Transformers (ViT) and BEiT. It proposes a two-stage fine-tuning strategy—initial on benchmark SER datasets, then on participant-specific data—with ensembling to capture individual differences. Experiments across five datasets and a pseudo-naturalistic HRI corpus show state-of-the-art results on RAVDESS and TESS and strong per-participant performance, highlighting the value of personalized SER in adaptive HRI. The findings suggest practical implications for responsive social robots and point to future multimodal and few-shot learning extensions.

Abstract

Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.

Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers

TL;DR

This work tackles personalized Speech Emotion Recognition in Human-Robot Interaction by leveraging Vision Transformers (ViT) and BEiT. It proposes a two-stage fine-tuning strategy—initial on benchmark SER datasets, then on participant-specific data—with ensembling to capture individual differences. Experiments across five datasets and a pseudo-naturalistic HRI corpus show state-of-the-art results on RAVDESS and TESS and strong per-participant performance, highlighting the value of personalized SER in adaptive HRI. The findings suggest practical implications for responsive social robots and point to future multimodal and few-shot learning extensions.

Abstract

Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
Paper Structure (11 sections, 7 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 11 sections, 7 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: The two pipelines evaluated in this paper for speech emotion recognition.
  • Figure 2: T-SNE plots of ViT and BEiT embeddings for each emotion of all datasets and our collected participants' data. The feature space of the emotional representations for the ViT and the BEiT models for each emotion is shown for all benchmark datasets as well as the participant data.