Table of Contents
Fetching ...

Generation of Real-time Robotic Emotional Expressions Learning from Human Demonstration in Mixed Reality

Chao Wang, Michael Gienger, Fan Zhang

TL;DR

This work tackles the challenge of producing natural, emotionally expressive robot behavior by learning from human demonstrations gathered in mixed reality. It introduces a MR data-collection platform that maps facial cues and gestures to robot components and couples it with a flow-matching, emotion-conditioned generator to synthesize continuous robot poses in real time. The system is demonstrated on a real robot with adaptive visual feedback to mitigate motion sickness, and it is supported by an Emotional-Expression Dataset covering seven emotions at 10 Hz. Preliminary results confirm real-time capability and reveal directions for temporal modeling improvements, dataset expansion, and user studies to quantify recognizability and naturalness.

Abstract

Expressive behaviors in robots are critical for effectively conveying their emotional states during interactions with humans. In this work, we present a framework that autonomously generates realistic and diverse robotic emotional expressions based on expert human demonstrations captured in Mixed Reality (MR). Our system enables experts to teleoperate a virtual robot from a first-person perspective, capturing their facial expressions, head movements, and upper-body gestures, and mapping these behaviors onto corresponding robotic components including eyes, ears, neck, and arms. Leveraging a flow-matching-based generative process, our model learns to produce coherent and varied behaviors in real-time in response to moving objects, conditioned explicitly on given emotional states. A preliminary test validated the effectiveness of our approach for generating autonomous expressions.

Generation of Real-time Robotic Emotional Expressions Learning from Human Demonstration in Mixed Reality

TL;DR

This work tackles the challenge of producing natural, emotionally expressive robot behavior by learning from human demonstrations gathered in mixed reality. It introduces a MR data-collection platform that maps facial cues and gestures to robot components and couples it with a flow-matching, emotion-conditioned generator to synthesize continuous robot poses in real time. The system is demonstrated on a real robot with adaptive visual feedback to mitigate motion sickness, and it is supported by an Emotional-Expression Dataset covering seven emotions at 10 Hz. Preliminary results confirm real-time capability and reveal directions for temporal modeling improvements, dataset expansion, and user studies to quantify recognizability and naturalness.

Abstract

Expressive behaviors in robots are critical for effectively conveying their emotional states during interactions with humans. In this work, we present a framework that autonomously generates realistic and diverse robotic emotional expressions based on expert human demonstrations captured in Mixed Reality (MR). Our system enables experts to teleoperate a virtual robot from a first-person perspective, capturing their facial expressions, head movements, and upper-body gestures, and mapping these behaviors onto corresponding robotic components including eyes, ears, neck, and arms. Leveraging a flow-matching-based generative process, our model learns to produce coherent and varied behaviors in real-time in response to moving objects, conditioned explicitly on given emotional states. A preliminary test validated the effectiveness of our approach for generating autonomous expressions.

Paper Structure

This paper contains 13 sections, 7 equations, 4 figures.

Figures (4)

  • Figure 1: The XR platform: 1. 7 facial-expression values detected by the XR-headset map the robot's ears angle and shape of the eyes, the gaze direction also maps the position of the eyes on the plane of robot face screen. Some value of the facial expression also maps the movement of the robot's ear. 2. Human's head position and orientation maps the robot's end effector, relative to the operator's head pose as the origin. The positional value is sceled by 1.5 for enhancing operator's reachablitiy. 3. There is an virtual screen floating in front of the operator, which allows the operator to observe the environment from the first person perpective.
  • Figure 2: Top: The MR headset subscribes to the actual rotation of the robot head and computes the rotation difference with respect to the human head. Bottom: The teleoperation view locally compensates the virtual screen orientation according to this difference, aligning the displayed video stream with the robot head pose and reducing visual mismatch to mitigate motion sickness.
  • Figure 3: Overview of flow matching for expression generation. A history window of robot and target poses plus an emotion label (pink) is fed through FiLM-conditioned U-Net to predict the blue action sequence executed on the robot.
  • Figure 4: Generated emotional expressions directed toward the moving target object.