An Android Robot Head as Embodied Conversational Agent
Marcel Heisler, Christian Becker-Asano
TL;DR
The study demonstrates an autonomous embodied conversational agent realized as an android robot head by coupling four ML-driven components—Whisper for ASR, VITS for TTS, ChatGPT for dialogue, and a FaceFormer-based lip-sync pipeline—with manual animation schedules. The system runs as a Python web application that communicates with ML services over REST, enabling an interactive experience via an external speaker and pre-scripted head movements. An iterative development approach, guided by public demonstrations and internal feedback, drives incremental improvements (e.g., adding speech input and multilingual capabilities) and informs targeted enhancements to gaze and lip-sync. While the prototype proves the feasibility of a ChatGPT-centered embodied agent, limitations such as privacy, dependency on closed-source components, and on-device edge deployment challenges are acknowledged, guiding future work toward more robust, privacy-preserving, and multilingual embodiments.
Abstract
This paper describes, how current Machine Learning (ML) techniques combined with simple rule-based animation routines make an android robot head an embodied conversational agent with ChatGPT as its core component. The android robot head is described, technical details are given of how lip-sync animation is being achieved, and general software design decisions are presented. A public presentation of the system revealed improvement opportunities that are reported and that lead our iterative implementation approach.
