Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers
Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa
TL;DR
This work tackles personalized Speech Emotion Recognition in Human-Robot Interaction by leveraging Vision Transformers (ViT) and BEiT. It proposes a two-stage fine-tuning strategy—initial on benchmark SER datasets, then on participant-specific data—with ensembling to capture individual differences. Experiments across five datasets and a pseudo-naturalistic HRI corpus show state-of-the-art results on RAVDESS and TESS and strong per-participant performance, highlighting the value of personalized SER in adaptive HRI. The findings suggest practical implications for responsive social robots and point to future multimodal and few-shot learning extensions.
Abstract
Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
