Table of Contents
Fetching ...

Self context-aware emotion perception on human-robot interaction

Zihan Lin, Francisco Cruz, Eduardo Benitez Sandoval

TL;DR

The paper addresses the challenge of continuous, context-informed emotion perception in long-term human-robot interaction by introducing SCAM, a self context-aware model that anchors emotions in a two-dimensional valence–arousal space and jointly leverages prior context with current observations. The method combines a per-segment multi-task network (emotion, valence, arousal) with a self context-aware structure that propagates information across segments and enforces a context-consistent learning signal via a dedicated context loss. Key contributions include segment relabeling with contextual anchoring, a novel context propagation mechanism, and a cosine-based context loss that captures emotional change trends, yielding state-of-the-art or competitive performance across audio, visual, and multimodal modalities on IEMOCAP. The results demonstrate meaningful gains in emotion recognition and dimensional regression, highlighting the practical potential for adaptive, context-aware HRI systems; future work targets robot-based validation and richer multimodal fusion strategies.

Abstract

Emotion recognition plays a crucial role in various domains of human-robot interaction. In long-term interactions with humans, robots need to respond continuously and accurately, however, the mainstream emotion recognition methods mostly focus on short-term emotion recognition, disregarding the context in which emotions are perceived. Humans consider that contextual information and different contexts can lead to completely different emotional expressions. In this paper, we introduce self context-aware model (SCAM) that employs a two-dimensional emotion coordinate system for anchoring and re-labeling distinct emotions. Simultaneously, it incorporates its distinctive information retention structure and contextual loss. This approach has yielded significant improvements across audio, video, and multimodal. In the auditory modality, there has been a notable enhancement in accuracy, rising from 63.10% to 72.46%. Similarly, the visual modality has demonstrated improved accuracy, increasing from 77.03% to 80.82%. In the multimodal, accuracy has experienced an elevation from 77.48% to 78.93%. In the future, we will validate the reliability and usability of SCAM on robots through psychology experiments.

Self context-aware emotion perception on human-robot interaction

TL;DR

The paper addresses the challenge of continuous, context-informed emotion perception in long-term human-robot interaction by introducing SCAM, a self context-aware model that anchors emotions in a two-dimensional valence–arousal space and jointly leverages prior context with current observations. The method combines a per-segment multi-task network (emotion, valence, arousal) with a self context-aware structure that propagates information across segments and enforces a context-consistent learning signal via a dedicated context loss. Key contributions include segment relabeling with contextual anchoring, a novel context propagation mechanism, and a cosine-based context loss that captures emotional change trends, yielding state-of-the-art or competitive performance across audio, visual, and multimodal modalities on IEMOCAP. The results demonstrate meaningful gains in emotion recognition and dimensional regression, highlighting the practical potential for adaptive, context-aware HRI systems; future work targets robot-based validation and richer multimodal fusion strategies.

Abstract

Emotion recognition plays a crucial role in various domains of human-robot interaction. In long-term interactions with humans, robots need to respond continuously and accurately, however, the mainstream emotion recognition methods mostly focus on short-term emotion recognition, disregarding the context in which emotions are perceived. Humans consider that contextual information and different contexts can lead to completely different emotional expressions. In this paper, we introduce self context-aware model (SCAM) that employs a two-dimensional emotion coordinate system for anchoring and re-labeling distinct emotions. Simultaneously, it incorporates its distinctive information retention structure and contextual loss. This approach has yielded significant improvements across audio, video, and multimodal. In the auditory modality, there has been a notable enhancement in accuracy, rising from 63.10% to 72.46%. Similarly, the visual modality has demonstrated improved accuracy, increasing from 77.03% to 80.82%. In the multimodal, accuracy has experienced an elevation from 77.48% to 78.93%. In the future, we will validate the reliability and usability of SCAM on robots through psychology experiments.
Paper Structure (19 sections, 7 equations, 16 figures, 5 tables)

This paper contains 19 sections, 7 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Context interaction in HRI
  • Figure 2: IEMOCAP emotions on Valence-Arousal axis
  • Figure 3: Log-Mel spectrogram of one segment
  • Figure 4: Cropped frames of one segment
  • Figure 5: Segment structure (multimodal)
  • ...and 11 more figures