No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

Qiaoqiao Ren; Yuanbo Hou; Dick Botteldooren; Tony Belpaeme

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

Qiaoqiao Ren, Yuanbo Hou, Dick Botteldooren, Tony Belpaeme

TL;DR

This work tackles the problem of fixed robot speech parameters undermining intelligibility in diverse environments. It combines a large-scale empirical study (GLMM analyses) to identify how voice parameters, room acoustics, ambient noise, and user characteristics influence intelligibility and user experience, with a data-driven adaptive pipeline that first predicts ambient annoyance using a CNN-based ARP model on the DeLTA dataset and then maps environment/user inputs to adaptive voice parameters via an Environment-to-Voice (ETV) network. The key contributions include (i) quantifying how factors like $T_{30}$, ambient annoyance, distance, and pitch affect both intelligibility and UX; (ii) a real-time ARP model capable of predicting annoyance in $~1.87$ ms; and (iii) the ETV adaptive-speech system that significantly improves intelligibility and user experience relative to fixed speech across conditions. Collectively, the findings demonstrate the practicality of context- and user-aware robot speech for robust human-robot interaction in real-world acoustic environments.

Abstract

Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice.

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

TL;DR

, ambient annoyance, distance, and pitch affect both intelligibility and UX; (ii) a real-time ARP model capable of predicting annoyance in

ms; and (iii) the ETV adaptive-speech system that significantly improves intelligibility and user experience relative to fixed speech across conditions. Collectively, the findings demonstrate the practicality of context- and user-aware robot speech for robust human-robot interaction in real-world acoustic environments.

Abstract

Paper Structure (29 sections, 2 equations, 7 figures, 2 tables)

This paper contains 29 sections, 2 equations, 7 figures, 2 tables.

Introduction
Intelligibility assessment
Experimental design
Materials
Procedure
Factors influencing intelligibility
Participant characteristics
Robot voice parameters
Environmental factors
Intelligibility metrics
User experience
Speech intelligibility
Generalised linear mixed-effect models
Intelligibility model results
Speech intelligibility model analysis
...and 14 more sections

Figures (7)

Figure 1: Experimental procedure.
Figure 2: Illustration of different environments in which data was collected.
Figure 3: Reverberation time (T30) of different rooms.
Figure 4: The CNN-based ARP model for overall evaluation of ambient sounds.
Figure 5: Robot speech adaptive system.
...and 2 more figures

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

TL;DR

Abstract

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)