Table of Contents
Fetching ...

PERCY: Personal Emotional Robotic Conversational System

Zhijin Meng, Mohammed Althubyani, Shengyuan Xie, Imran Razzak, Eduardo B. Sandoval, Mahdi Bamdad, Francisco Cruz

TL;DR

The paper addresses the gap of emotional awareness and long-term personalization in social robotics by integrating open-domain GPT-4 reasoning with real-time multimodal affect grounding. It introduces PERCY, a ROS-based system that fuses visual and textual cues to condition responses via a multimodal GPT-4 reasoning engine, enabling synchronized verbal and non-verbal behavior. Through automated and human evaluations, PERCY demonstrates strong empathy, personalization, and competitive naturalness, achieving an emotion-recognition accuracy of $92.0\%$ and an end-to-end latency of $1.7\,\text{s}$, while outperforming text-only GPT-4 and EmpGPT-3 on personalization and diversity. The work offers practical insights and a foundation for scalable, ethically grounded emotionally intelligent human–robot interaction in open-domain settings, with open-source intent to catalyze future research.

Abstract

Traditional rule-based conversational robots, constrained by predefined scripts and static response mappings, fundamentally lack adaptability for personalized, long-term human interaction. While Large Language Models (LLMs) like GPT-4 have revolutionized conversational AI through open-domain capabilities, current social robots implementing LLMs still lack emotional awareness and continuous personalization. This dual limitation hinders their ability to sustain engagement across multiple interaction sessions. We bridge this gap with PERCY (Personal Emotional Robotic Conversational sYstem), a system designed to enable open-domain, multi-turn dialogues by dynamically analyzing users' real-time facial expressions and vocabulary to tailor responses based on their emotional state. Built on a ROS-based multimodal framework, PERCY integrates a fine-tuned GPT-4 reasoning engine, combining textual sentiment analysis with visual emotional cues to accurately assess and respond to user emotions. We evaluated PERCY's performance through various dialogue quality metrics, showing strong coherence, relevance, and diversity. Human evaluations revealed PERCY's superior personalization and comparable naturalness to other models. This work highlights the potential for integrating advanced multimodal perception and personalization in social robot dialogue systems.

PERCY: Personal Emotional Robotic Conversational System

TL;DR

The paper addresses the gap of emotional awareness and long-term personalization in social robotics by integrating open-domain GPT-4 reasoning with real-time multimodal affect grounding. It introduces PERCY, a ROS-based system that fuses visual and textual cues to condition responses via a multimodal GPT-4 reasoning engine, enabling synchronized verbal and non-verbal behavior. Through automated and human evaluations, PERCY demonstrates strong empathy, personalization, and competitive naturalness, achieving an emotion-recognition accuracy of and an end-to-end latency of , while outperforming text-only GPT-4 and EmpGPT-3 on personalization and diversity. The work offers practical insights and a foundation for scalable, ethically grounded emotionally intelligent human–robot interaction in open-domain settings, with open-source intent to catalyze future research.

Abstract

Traditional rule-based conversational robots, constrained by predefined scripts and static response mappings, fundamentally lack adaptability for personalized, long-term human interaction. While Large Language Models (LLMs) like GPT-4 have revolutionized conversational AI through open-domain capabilities, current social robots implementing LLMs still lack emotional awareness and continuous personalization. This dual limitation hinders their ability to sustain engagement across multiple interaction sessions. We bridge this gap with PERCY (Personal Emotional Robotic Conversational sYstem), a system designed to enable open-domain, multi-turn dialogues by dynamically analyzing users' real-time facial expressions and vocabulary to tailor responses based on their emotional state. Built on a ROS-based multimodal framework, PERCY integrates a fine-tuned GPT-4 reasoning engine, combining textual sentiment analysis with visual emotional cues to accurately assess and respond to user emotions. We evaluated PERCY's performance through various dialogue quality metrics, showing strong coherence, relevance, and diversity. Human evaluations revealed PERCY's superior personalization and comparable naturalness to other models. This work highlights the potential for integrating advanced multimodal perception and personalization in social robot dialogue systems.

Paper Structure

This paper contains 27 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: PERCY's architecture processes speech (converted to text) and facial expressions to assess real-time emotional states. These multimodal signals are fused to condition GPT-4's personalized response generation, which drives ARI's emotion-aware verbal and non-verbal response.
  • Figure 2: Real-time Emotion Recognition and Non-Verbal Control System Architecture: Deployed on an ARI robot, PERCY processes RGB inputs through the robot’s depth camera, leveraging a MobileNetV2 backbone and SSD model to simultaneously classify emotions and predict bounding boxes. User verbal inputs are concurrently analyzed using NLTK’s VADER sentiment analysis module. Fused visual and textual emotional cues drive real-time facial expression adjustments via the Emotion Control Module, while guiding GPT-4’s generation of context-aware verbal responses through TTS, enabling synchronized vision-language multimodal interaction.
  • Figure 3: PERCY integrates visual and textual emotional states into the robot's non-verbal actions using a unified action package. When sadness is detected, it triggers predefined facial expressions.
  • Figure 4: GPT-4 Multimodal Reasoning Flow: PERCY fuses visual affect (Section \ref{['subsubsec:emotion_recognition_computer_vision']}) and textual sentiment (Section \ref{['subsubsec:emotion_recognition_sentiment_analysis']}) via Eq. \ref{['eq:GPT_Prompt']}, integrating user personal information, emotional state, and dialogue history data to generate responses. Combines LLM capabilities with ethical constraints and affective alignment for context-aware interaction.