Table of Contents
Fetching ...

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization

Farida Mohsen, Ali Safa

TL;DR

The paper tackles real-time HRI intent detection using monocular RGB video on CPU-only hardware, addressing dataset imbalance and cross-domain generalization. It fuses camera-invariant 2D pose with facial emotion cues and evaluates frame- and sequence-level intent with lightweight temporal models, augmented by MINT-RVAE to synthesize coherent minority-class sequences. The approach achieves a strong offline AUROC of 0.95 and demonstrates real-world effectiveness with 91% accuracy and 100% recall during CPU-only deployment on the MIRA robot, even under cross-camera domain shifts. This work highlights a practical path toward affordable, robust, and proactive social robotics without depth sensing or GPUs.

Abstract

Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction. We present a practical multimodal framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video. Unlike prior methods requiring RGB-D sensors or GPU acceleration, our approach resource-constrained embedded hardware (Raspberry Pi 5, CPU-only). To address the severe class imbalance in natural human-robot interaction datasets, we introduce a novel approach to synthesize temporally coherent pose-emotion-label sequences for data re-balancing called MINT-RVAE (Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation). Comprehensive offline evaluations under cross-subject and cross-scene protocols demonstrate strong generalization performance, achieving frame- and sequence-level AUROC of 0.95. Crucially, we validate real-world generalization through cross-camera evaluation on the MIRA robot head, which employs a different onboard RGB sensor and operates in uncontrolled environments not represented in the training data. Despite this domain shift, the deployed system achieves 91% accuracy and 100% recall across 32 live interaction trials. The close correspondence between offline and deployed performance confirms the cross-sensor and cross-environment robustness of the proposed multimodal approach, highlighting its suitability for ubiquitous multimedia-enabled social robots.

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization

TL;DR

The paper tackles real-time HRI intent detection using monocular RGB video on CPU-only hardware, addressing dataset imbalance and cross-domain generalization. It fuses camera-invariant 2D pose with facial emotion cues and evaluates frame- and sequence-level intent with lightweight temporal models, augmented by MINT-RVAE to synthesize coherent minority-class sequences. The approach achieves a strong offline AUROC of 0.95 and demonstrates real-world effectiveness with 91% accuracy and 100% recall during CPU-only deployment on the MIRA robot, even under cross-camera domain shifts. This work highlights a practical path toward affordable, robust, and proactive social robotics without depth sensing or GPUs.

Abstract

Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction. We present a practical multimodal framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video. Unlike prior methods requiring RGB-D sensors or GPU acceleration, our approach resource-constrained embedded hardware (Raspberry Pi 5, CPU-only). To address the severe class imbalance in natural human-robot interaction datasets, we introduce a novel approach to synthesize temporally coherent pose-emotion-label sequences for data re-balancing called MINT-RVAE (Multimodal Recurrent Variational Autoencoder for Intent Sequence Generation). Comprehensive offline evaluations under cross-subject and cross-scene protocols demonstrate strong generalization performance, achieving frame- and sequence-level AUROC of 0.95. Crucially, we validate real-world generalization through cross-camera evaluation on the MIRA robot head, which employs a different onboard RGB sensor and operates in uncontrolled environments not represented in the training data. Despite this domain shift, the deployed system achieves 91% accuracy and 100% recall across 32 live interaction trials. The close correspondence between offline and deployed performance confirms the cross-sensor and cross-environment robustness of the proposed multimodal approach, highlighting its suitability for ubiquitous multimedia-enabled social robots.

Paper Structure

This paper contains 30 sections, 18 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Proposed human interaction intent detection setup. The proposed system operates on two different platforms: (A) the MIRA robot, which integrates an expressive robotic head for HRI and an RGB camera for visual perception; and (B) a MyCobot 320 setup equipped with a different RGB camera. Two-dimensional human pose features and facial emotion features are respectively extracted from the camera data using a YOLOv8-Pose model ultralytics_yolov8 and a DeepFace model serengil2020deepface. Then, the pose and emotion cues are fused into multimodal representations to predict interaction intent using various temporal models (GRU, LSTM, and a lightweight Transformer). (C) The proposed system runs in real-time on a resource-constrained embedded computing platform (Raspberry Pi 5) without GPU acceleration (CPU-only).
  • Figure 2: Proposed data processing pipeline. The RGB camera feed from the robot arm is processed using a YOLOv8-pose model for pose coordinates and DeepFace for emotion vectors. These outputs are concatenated into multimodal feature vectors, which serve as input to various intent detection backbones studied in this work (GRU, LSTM, Transformer). To address the class imbalance in HRI data, our proposed MINT-RVAE generates synthetic sequences during training. The pipeline supports both frame-level and sequence-level intent prediction.
  • Figure 3: Experimental robotic platforms. (a) Data collection setup using an Elephant Robotics MyCobot 320 arm equipped with a monocular RGB camera for recording human–robot approach sequences. (b) MIRA social robot head used for real-time deployment experiments embedded with Raspberry Pi 5 that performs CPU-only inference with onboard RGB sensing during evaluation.
  • Figure 4: Views from our data collection.
  • Figure 5: The proposed MINT-RVAE architicture. An input sequence $\mathcal{V}_r = \{\mathbf{z}_{1,r}, \dots, \mathbf{z}_{T_r,r}\}$ (each frame concatenating $\mathbf{g}^{\text{pose}}_{i,r}$, $\mathbf{g}^{\text{emo}}_{i,r}$, and $\ell_{i,r}$) is first processed by an MLP before being encoded by a GRU. The final encoder state parameterizes the latent vector $\mathbf{h} \in \mathbb{R}^{32}$ using the reparameterization trick. During decoding, $\mathbf{h}$ initializes the decoder GRU hidden state and is concatenated with the previous input $\tilde{\mathbf{z}}_{i,r}$ at each time step to predict the next multimodal frame $\hat{\mathbf{z}}_{i+1,r}$. The Probability Selector block (used only during training) controls the teacher-forcing rate, determining whether the decoder receives the ground-truth or predicted input. During inference, the latent vector $\mathbf{h}$ is sampled from the standard normal prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$, and the decoder operates in a fully autoregressive mode, feeding back its own predictions $\hat{\mathbf{z}}_{i,r}$ as input.
  • ...and 6 more figures