Table of Contents
Fetching ...

EVOLVE: Emotion and Visual Output Learning via LLM Evaluation

Jordan Sinclair, Christopher Reardon

TL;DR

The paper addresses the challenge of producing believable empathy in social robots by coordinating verbal and nonverbal cues. It introduces an LLM-driven pipeline that maps each visual input to three integral outputs—an emoji-based affect, an LED color palette, and a motion pattern—using vision-language models and an atomic-action schema to support open-ended emotional responses. Key contributions include the integration of vision-language reasoning for image analysis, an expanded emoji-informed affect space, and modular action building blocks to convey empathy through color and motion. Initial results show general alignment with expected affects but reveal biases and dependence on image content for color selection, motivating future work with prompt refinement and a Retrieval Augmented Generation memory to personalize interactions and mitigate bias.

Abstract

Human acceptance of social robots is greatly effected by empathy and perceived understanding. This necessitates accurate and flexible responses to various input data from the user. While systems such as this can become increasingly complex as more states or response types are included, new research in the application of large language models towards human-robot interaction has allowed for more streamlined perception and reaction pipelines. LLM-selected actions and emotional expressions can help reinforce the realism of displayed empathy and allow for improved communication between the robot and user. Beyond portraying empathy in spoken or written responses, this shows the possibilities of using LLMs in actuated, real world scenarios. In this work we extend research in LLM-driven nonverbal behavior for social robots by considering more open-ended emotional response selection leveraging new advances in vision-language models, along with emotionally aligned motion and color pattern selections that strengthen conveyance of meaning and empathy.

EVOLVE: Emotion and Visual Output Learning via LLM Evaluation

TL;DR

The paper addresses the challenge of producing believable empathy in social robots by coordinating verbal and nonverbal cues. It introduces an LLM-driven pipeline that maps each visual input to three integral outputs—an emoji-based affect, an LED color palette, and a motion pattern—using vision-language models and an atomic-action schema to support open-ended emotional responses. Key contributions include the integration of vision-language reasoning for image analysis, an expanded emoji-informed affect space, and modular action building blocks to convey empathy through color and motion. Initial results show general alignment with expected affects but reveal biases and dependence on image content for color selection, motivating future work with prompt refinement and a Retrieval Augmented Generation memory to personalize interactions and mitigate bias.

Abstract

Human acceptance of social robots is greatly effected by empathy and perceived understanding. This necessitates accurate and flexible responses to various input data from the user. While systems such as this can become increasingly complex as more states or response types are included, new research in the application of large language models towards human-robot interaction has allowed for more streamlined perception and reaction pipelines. LLM-selected actions and emotional expressions can help reinforce the realism of displayed empathy and allow for improved communication between the robot and user. Beyond portraying empathy in spoken or written responses, this shows the possibilities of using LLMs in actuated, real world scenarios. In this work we extend research in LLM-driven nonverbal behavior for social robots by considering more open-ended emotional response selection leveraging new advances in vision-language models, along with emotionally aligned motion and color pattern selections that strengthen conveyance of meaning and empathy.
Paper Structure (3 sections, 5 figures)

This paper contains 3 sections, 5 figures.

Figures (5)

  • Figure 1: (a) The LLM evaluates a camera image input and determines three visual outputs that evolve with new data: an emoji representing affective response, a color palette to be visualized on LEDs, and a motion pattern. (b) Potential robot design with these characteristics.
  • Figure 2: Prompt procedure
  • Figure 3: LLM response for image initially labelled as contentment.
  • Figure 4: LLM response for image initially labelled as excitement.
  • Figure 5: LLM response for image initially labelled as fear.