Table of Contents
Fetching ...

An Android Robot Head as Embodied Conversational Agent

Marcel Heisler, Christian Becker-Asano

TL;DR

The study demonstrates an autonomous embodied conversational agent realized as an android robot head by coupling four ML-driven components—Whisper for ASR, VITS for TTS, ChatGPT for dialogue, and a FaceFormer-based lip-sync pipeline—with manual animation schedules. The system runs as a Python web application that communicates with ML services over REST, enabling an interactive experience via an external speaker and pre-scripted head movements. An iterative development approach, guided by public demonstrations and internal feedback, drives incremental improvements (e.g., adding speech input and multilingual capabilities) and informs targeted enhancements to gaze and lip-sync. While the prototype proves the feasibility of a ChatGPT-centered embodied agent, limitations such as privacy, dependency on closed-source components, and on-device edge deployment challenges are acknowledged, guiding future work toward more robust, privacy-preserving, and multilingual embodiments.

Abstract

This paper describes, how current Machine Learning (ML) techniques combined with simple rule-based animation routines make an android robot head an embodied conversational agent with ChatGPT as its core component. The android robot head is described, technical details are given of how lip-sync animation is being achieved, and general software design decisions are presented. A public presentation of the system revealed improvement opportunities that are reported and that lead our iterative implementation approach.

An Android Robot Head as Embodied Conversational Agent

TL;DR

The study demonstrates an autonomous embodied conversational agent realized as an android robot head by coupling four ML-driven components—Whisper for ASR, VITS for TTS, ChatGPT for dialogue, and a FaceFormer-based lip-sync pipeline—with manual animation schedules. The system runs as a Python web application that communicates with ML services over REST, enabling an interactive experience via an external speaker and pre-scripted head movements. An iterative development approach, guided by public demonstrations and internal feedback, drives incremental improvements (e.g., adding speech input and multilingual capabilities) and informs targeted enhancements to gaze and lip-sync. While the prototype proves the feasibility of a ChatGPT-centered embodied agent, limitations such as privacy, dependency on closed-source components, and on-device edge deployment challenges are acknowledged, guiding future work toward more robust, privacy-preserving, and multilingual embodiments.

Abstract

This paper describes, how current Machine Learning (ML) techniques combined with simple rule-based animation routines make an android robot head an embodied conversational agent with ChatGPT as its core component. The android robot head is described, technical details are given of how lip-sync animation is being achieved, and general software design decisions are presented. A public presentation of the system revealed improvement opportunities that are reported and that lead our iterative implementation approach.
Paper Structure (13 sections, 3 figures)

This paper contains 13 sections, 3 figures.

Figures (3)

  • Figure 1: Actuators of the android robot head. Dotted lines indicate symmetric movements by a single actuator.
  • Figure 2: Overview of the current implementation. Top: basic pipeline from user input to response spoken by the android robot head. Bottom: animation phases according to the current timestep of the pipeline.
  • Figure 3: Current setup of embodied conversational agent application.