Table of Contents
Fetching ...

Sound-Based Recognition of Touch Gestures and Emotions for Enhanced Human-Robot Interaction

Yuanbo Hou, Qiaoqiao Ren, Wenwu Wang, Dick Botteldooren

TL;DR

The paper tackles privacy-sensitive emotion recognition and tactile interpretation in human-robot interaction by leveraging sounds generated during touch, rather than vision or full-body tactile sensing. It introduces MTRCNN, a lightweight, on-device audio-only model with 0.24M parameters and 0.708 GFLOPs, designed to run within Pepper's constraints (minimum input length 1.10s, size 0.94 MB). Evaluated on a Pepper-based dataset with 28 participants, the approach achieves gesture accuracy around 82–85% and arousal/valence accuracies around 70% and 63%, respectively, with 6–7 second inputs optimizing performance. Compared to larger pretrained audio models, MTRCNN delivers competitive gesture recognition while requiring far fewer computations, enabling real-time, privacy-preserving HRI. This work demonstrates a viable pathway for deploying audio-only touch-sensing emotion recognition in social robots.

Abstract

Emotion recognition and touch gesture decoding are crucial for advancing human-robot interaction (HRI), especially in social environments where emotional cues and tactile perception play important roles. However, many humanoid robots, such as Pepper, Nao, and Furhat, lack full-body tactile skin, limiting their ability to engage in touch-based emotional and gesture interactions. In addition, vision-based emotion recognition methods usually face strict GDPR compliance challenges due to the need to collect personal facial data. To address these limitations and avoid privacy issues, this paper studies the potential of using the sounds produced by touching during HRI to recognise tactile gestures and classify emotions along the arousal and valence dimensions. Using a dataset of tactile gestures and emotional interactions from 28 participants with the humanoid robot Pepper, we design an audio-only lightweight touch gesture and emotion recognition model with only 0.24M parameters, 0.94MB model size, and 0.7G FLOPs. Experimental results show that the proposed sound-based touch gesture and emotion recognition model effectively recognises the arousal and valence states of different emotions, as well as various tactile gestures, when the input audio length varies. The proposed model is low-latency and achieves similar results as well-known pretrained audio neural networks (PANNs), but with much smaller FLOPs, parameters, and model size.

Sound-Based Recognition of Touch Gestures and Emotions for Enhanced Human-Robot Interaction

TL;DR

The paper tackles privacy-sensitive emotion recognition and tactile interpretation in human-robot interaction by leveraging sounds generated during touch, rather than vision or full-body tactile sensing. It introduces MTRCNN, a lightweight, on-device audio-only model with 0.24M parameters and 0.708 GFLOPs, designed to run within Pepper's constraints (minimum input length 1.10s, size 0.94 MB). Evaluated on a Pepper-based dataset with 28 participants, the approach achieves gesture accuracy around 82–85% and arousal/valence accuracies around 70% and 63%, respectively, with 6–7 second inputs optimizing performance. Compared to larger pretrained audio models, MTRCNN delivers competitive gesture recognition while requiring far fewer computations, enabling real-time, privacy-preserving HRI. This work demonstrates a viable pathway for deploying audio-only touch-sensing emotion recognition in social robots.

Abstract

Emotion recognition and touch gesture decoding are crucial for advancing human-robot interaction (HRI), especially in social environments where emotional cues and tactile perception play important roles. However, many humanoid robots, such as Pepper, Nao, and Furhat, lack full-body tactile skin, limiting their ability to engage in touch-based emotional and gesture interactions. In addition, vision-based emotion recognition methods usually face strict GDPR compliance challenges due to the need to collect personal facial data. To address these limitations and avoid privacy issues, this paper studies the potential of using the sounds produced by touching during HRI to recognise tactile gestures and classify emotions along the arousal and valence dimensions. Using a dataset of tactile gestures and emotional interactions from 28 participants with the humanoid robot Pepper, we design an audio-only lightweight touch gesture and emotion recognition model with only 0.24M parameters, 0.94MB model size, and 0.7G FLOPs. Experimental results show that the proposed sound-based touch gesture and emotion recognition model effectively recognises the arousal and valence states of different emotions, as well as various tactile gestures, when the input audio length varies. The proposed model is low-latency and achieves similar results as well-known pretrained audio neural networks (PANNs), but with much smaller FLOPs, parameters, and model size.
Paper Structure (9 sections, 2 equations, 5 figures, 3 tables)

This paper contains 9 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The robot Pepper's physical information$^1$.
  • Figure 2: The participant interacts with the robot Pepper.
  • Figure 3: Circumplex Model russell1980circumplex with 10 emotions in this paper.
  • Figure 4: The proposed lightweight multi-temporal resolution convolutional neural network (MTRCNN).
  • Figure 5: Normalized confusion matrix on the test set.