EVOLVE: Emotion and Visual Output Learning via LLM Evaluation
Jordan Sinclair, Christopher Reardon
TL;DR
The paper addresses the challenge of producing believable empathy in social robots by coordinating verbal and nonverbal cues. It introduces an LLM-driven pipeline that maps each visual input to three integral outputs—an emoji-based affect, an LED color palette, and a motion pattern—using vision-language models and an atomic-action schema to support open-ended emotional responses. Key contributions include the integration of vision-language reasoning for image analysis, an expanded emoji-informed affect space, and modular action building blocks to convey empathy through color and motion. Initial results show general alignment with expected affects but reveal biases and dependence on image content for color selection, motivating future work with prompt refinement and a Retrieval Augmented Generation memory to personalize interactions and mitigate bias.
Abstract
Human acceptance of social robots is greatly effected by empathy and perceived understanding. This necessitates accurate and flexible responses to various input data from the user. While systems such as this can become increasingly complex as more states or response types are included, new research in the application of large language models towards human-robot interaction has allowed for more streamlined perception and reaction pipelines. LLM-selected actions and emotional expressions can help reinforce the realism of displayed empathy and allow for improved communication between the robot and user. Beyond portraying empathy in spoken or written responses, this shows the possibilities of using LLMs in actuated, real world scenarios. In this work we extend research in LLM-driven nonverbal behavior for social robots by considering more open-ended emotional response selection leveraging new advances in vision-language models, along with emotionally aligned motion and color pattern selections that strengthen conveyance of meaning and empathy.
