Table of Contents
Fetching ...

Touch Speaks, Sound Feels: A Multimodal Approach to Affective and Social Touch from Robots to Humans

Qiaoqiao Ren, Tony Belpaeme

TL;DR

This study investigates affective and social touch from robots to humans by integrating a vibrotactile sleeve with contact-generated audio to form audiotactile feedback. A within-subject experiment with $32$ Chinese participants demonstrates that multimodal (haptic+audio) stimuli yield higher emotion decoding accuracy ($44.1\%$ across $10$ emotions) than either modality alone ($25\%$ for touch, $31.6\%$ for sound), and gesture decoding generally benefits from multimodal cues. The VibroSleeve features a $5\times5$ motor grid controlled via PWM, paired with noise-cancelling headphones, and a dataset of $84$ audio clips and $84$ tactile sequences derived from $28$ participants, with stimuli selected by $k$-means and translated into vibrotactile patterns. The results underscore the complementary roles of haptic and auditory cues in affective human–robot interaction and offer design guidance for socially expressive robots that leverage multisensory feedback to enhance emotional communication.

Abstract

Affective tactile interaction constitutes a fundamental component of human communication. In natural human-human encounters, touch is seldom experienced in isolation; rather, it is inherently multisensory. Individuals not only perceive the physical sensation of touch but also register the accompanying auditory cues generated through contact. The integration of haptic and auditory information forms a rich and nuanced channel for emotional expression. While extensive research has examined how robots convey emotions through facial expressions and speech, their capacity to communicate social gestures and emotions via touch remains largely underexplored. To address this gap, we developed a multimodal interaction system incorporating a 5*5 grid of 25 vibration motors synchronized with audio playback, enabling robots to deliver combined haptic-audio stimuli. In an experiment involving 32 Chinese participants, ten emotions and six social gestures were presented through vibration, sound, or their combination. Participants rated each stimulus on arousal and valence scales. The results revealed that (1) the combined haptic-audio modality significantly enhanced decoding accuracy compared to single modalities; (2) each individual channel-vibration or sound-effectively supported certain emotions recognition, with distinct advantages depending on the emotional expression; and (3) gestures alone were generally insufficient for conveying clearly distinguishable emotions. These findings underscore the importance of multisensory integration in affective human-robot interaction and highlight the complementary roles of haptic and auditory cues in enhancing emotional communication.

Touch Speaks, Sound Feels: A Multimodal Approach to Affective and Social Touch from Robots to Humans

TL;DR

This study investigates affective and social touch from robots to humans by integrating a vibrotactile sleeve with contact-generated audio to form audiotactile feedback. A within-subject experiment with Chinese participants demonstrates that multimodal (haptic+audio) stimuli yield higher emotion decoding accuracy ( across emotions) than either modality alone ( for touch, for sound), and gesture decoding generally benefits from multimodal cues. The VibroSleeve features a motor grid controlled via PWM, paired with noise-cancelling headphones, and a dataset of audio clips and tactile sequences derived from participants, with stimuli selected by -means and translated into vibrotactile patterns. The results underscore the complementary roles of haptic and auditory cues in affective human–robot interaction and offer design guidance for socially expressive robots that leverage multisensory feedback to enhance emotional communication.

Abstract

Affective tactile interaction constitutes a fundamental component of human communication. In natural human-human encounters, touch is seldom experienced in isolation; rather, it is inherently multisensory. Individuals not only perceive the physical sensation of touch but also register the accompanying auditory cues generated through contact. The integration of haptic and auditory information forms a rich and nuanced channel for emotional expression. While extensive research has examined how robots convey emotions through facial expressions and speech, their capacity to communicate social gestures and emotions via touch remains largely underexplored. To address this gap, we developed a multimodal interaction system incorporating a 5*5 grid of 25 vibration motors synchronized with audio playback, enabling robots to deliver combined haptic-audio stimuli. In an experiment involving 32 Chinese participants, ten emotions and six social gestures were presented through vibration, sound, or their combination. Participants rated each stimulus on arousal and valence scales. The results revealed that (1) the combined haptic-audio modality significantly enhanced decoding accuracy compared to single modalities; (2) each individual channel-vibration or sound-effectively supported certain emotions recognition, with distinct advantages depending on the emotional expression; and (3) gestures alone were generally insufficient for conveying clearly distinguishable emotions. These findings underscore the importance of multisensory integration in affective human-robot interaction and highlight the complementary roles of haptic and auditory cues in enhancing emotional communication.

Paper Structure

This paper contains 23 sections, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Emotions distribution based on Russell’s circumplex model. Emotions are positioned on Russell’s circumplex model according to their valence and arousal levels. The intention "grab attention" is not an emotion and is therefore placed at the origin (Quadrant 0), representing its status as a communicative act outside the affective dimensions.
  • Figure 2: Decoding performance across different modalities. The figure compares unimodal (tactile or auditory) and multimodal (tactile + auditory) approaches, showing that multimodal integration consistently outperforms single-modality decoding.
  • Figure 3: Mediate touch from the robot.
  • Figure 4: Vibration sleeves.
  • Figure 5: Top: Waveform of "Anger" audio, showing the raw time-domain signal. Bottom: Original amplitude envelope of the same excerpt, illustrating the smoothed contour of overall loudness variations across time.
  • ...and 11 more figures