Xpress: A System For Dynamic, Context-Aware Robot Facial Expressions using Language Models
Victor Nikhil Antony, Maia Stiber, Chien-Ming Huang
TL;DR
Xpress presents a three-phase, LM-driven pipeline that generates context-aware robotic facial expressions as a temporal trajectory $\tau_{face}$ conditioned on interaction content and socio-emotional context $\{z_T,z_C\}$. By coupling a flexible 28-DoF face with LM-based code generation, the approach enables dynamic, expressive, and contextually aligned facial behavior across storytelling and real-time conversation. Empirical evaluations show high executable-code rates (≈100%), substantial context alignment (≈71–75%), and strong expressiveness (≈3.6–3.9 on a 5-point scale), with notable variation in perceived intensity and some latency considerations in real-time deployment. The work demonstrates the practical potential of LM-assisted expression generation in social robotics while outlining key extensions to multi-modal cues, co-creation, and real-time performance.
Abstract
Facial expressions are vital in human communication and significantly influence outcomes in human-robot interaction (HRI), such as likeability, trust, and companionship. However, current methods for generating robotic facial expressions are often labor-intensive, lack adaptability across contexts and platforms, and have limited expressive ranges--leading to repetitive behaviors that reduce interaction quality, particularly in long-term scenarios. We introduce Xpress, a system that leverages language models (LMs) to dynamically generate context-aware facial expressions for robots through a three-phase process: encoding temporal flow, conditioning expressions on context, and generating facial expression code. We demonstrated Xpress as a proof-of-concept through two user studies (n=15x2) and a case study with children and parents (n=13), in storytelling and conversational scenarios to assess the system's context-awareness, expressiveness, and dynamism. Results demonstrate Xpress's ability to dynamically produce expressive and contextually appropriate facial expressions, highlighting its versatility and potential in HRI applications.
