Does ChatGPT and Whisper Make Humanoid Robots More Relatable?
Xiaohui Chen, Katherine Luo, Trevor Gee, Mahla Nejati
TL;DR
This work investigates whether integrating Whisper ASR and ChatGPT with the Pepper humanoid robot enhances relatability and interaction quality. The Pepper-GPT architecture combines a Whisper-based speech recognizer with GPT-3.5-turbo to produce either physical actions or natural language responses, coordinated by a PepperController. Evaluation shows Whisper delivers the best speech recognition performance (WER around 1.716% and processing time ~2.639 s) across diverse accents, and a 15-participant user study reports generally positive experiences, with 60% rating interactions as excellent. The study identifies remaining challenges such as multilingual support and facial tracking, offering a roadmap for more natural, engaging hri systems with greater practical impact.
Abstract
Humanoid robots are designed to be relatable to humans for applications such as customer support and helpdesk services. However, many such systems, including Softbank's Pepper, fall short because they fail to communicate effectively with humans. The advent of Large Language Models (LLMs) shows the potential to solve the communication barrier for humanoid robotics. This paper outlines the comparison of different Automatic Speech Recognition (ASR) APIs, the integration of Whisper ASR and ChatGPT with the Pepper robot and the evaluation of the system (Pepper-GPT) tested by 15 human users. The comparison result shows that, compared to the Google ASR and Google Cloud ASR, the Whisper ASR performed best as its average Word Error Rate (1.716%) and processing time (2.639 s) are both the lowest. The participants' usability investigations show that 60% of the participants thought the performance of the Pepper-GPT was "excellent", while the rest rated this system as "good" in the subsequent experiments. It is proved that while some problems still need to be overcome, such as the robot's multilingual ability and facial tracking capacity, users generally responded positively to the system, feeling like talking to an actual human.
