Table of Contents
Fetching ...

Giving Robots a Voice: Human-in-the-Loop Voice Creation and open-ended Labeling

Pol van Rijn, Silvan Mertes, Kathrin Janowski, Katharina Weitz, Nori Jacoby, Elisabeth André

TL;DR

The work tackles aligning robot voice with appearance using a large-scale, human-in-the-loop framework that integrates a voice-creation tool, iterative voice tuning via Gibbs Sampling with People, and open-ended labeling through STEP-Tag. It demonstrates end-to-end feasibility by collecting perceptual ratings, building a diverse attribute taxonomy, and predicting well-matched voices for unseen robots, with robust generalization across datasets and online prediction tools. The findings reveal consistent cross-modal perceptual structures and show that visual cues can guide voice selection, offering a practical pipeline for designers while highlighting ethical considerations and biases in robot perception. This study exemplifies a productive fusion of cognitive science methods and machine learning to address real-world engineering challenges in human-robot interaction.

Abstract

Speech is a natural interface for humans to interact with robots. Yet, aligning a robot's voice to its appearance is challenging due to the rich vocabulary of both modalities. Previous research has explored a few labels to describe robots and tested them on a limited number of robots and existing voices. Here, we develop a robot-voice creation tool followed by large-scale behavioral human experiments (N=2,505). First, participants collectively tune robotic voices to match 175 robot images using an adaptive human-in-the-loop pipeline. Then, participants describe their impression of the robot or their matched voice using another human-in-the-loop paradigm for open-ended labeling. The elicited taxonomy is then used to rate robot attributes and to predict the best voice for an unseen robot. We offer a web interface to aid engineers in customizing robot voices, demonstrating the synergy between cognitive science and machine learning for engineering tools.

Giving Robots a Voice: Human-in-the-Loop Voice Creation and open-ended Labeling

TL;DR

The work tackles aligning robot voice with appearance using a large-scale, human-in-the-loop framework that integrates a voice-creation tool, iterative voice tuning via Gibbs Sampling with People, and open-ended labeling through STEP-Tag. It demonstrates end-to-end feasibility by collecting perceptual ratings, building a diverse attribute taxonomy, and predicting well-matched voices for unseen robots, with robust generalization across datasets and online prediction tools. The findings reveal consistent cross-modal perceptual structures and show that visual cues can guide voice selection, offering a practical pipeline for designers while highlighting ethical considerations and biases in robot perception. This study exemplifies a productive fusion of cognitive science methods and machine learning to address real-world engineering challenges in human-robot interaction.

Abstract

Speech is a natural interface for humans to interact with robots. Yet, aligning a robot's voice to its appearance is challenging due to the rich vocabulary of both modalities. Previous research has explored a few labels to describe robots and tested them on a limited number of robots and existing voices. Here, we develop a robot-voice creation tool followed by large-scale behavioral human experiments (N=2,505). First, participants collectively tune robotic voices to match 175 robot images using an adaptive human-in-the-loop pipeline. Then, participants describe their impression of the robot or their matched voice using another human-in-the-loop paradigm for open-ended labeling. The elicited taxonomy is then used to rate robot attributes and to predict the best voice for an unseen robot. We offer a web interface to aid engineers in customizing robot voices, demonstrating the synergy between cognitive science and machine learning for engineering tools.
Paper Structure (54 sections, 32 figures, 10 tables)

This paper contains 54 sections, 32 figures, 10 tables.

Figures (32)

  • Figure 1: Human-in-the-loop paradigms. A Gibbs Sampling with People. Participants change the slider, modifying only one dimension at a time. By cycling over the dimensions, participants explore dense regions in the feature space that are associated with a given robot. B STEP-Tag. Through the labeling process, participants simultaneously create new tags and review the tags provided by others. Over many iterations, meaningful and rich semantic labels are efficiently collected for each robot image.
  • Figure 2: Architecture. The voice of the robot is controlled via eight sliders. The first five sliders control the voice of the TTS model using the first five PCA dimensions on the speaker embeddings. The sixth slider controls the speed of the speech. The seventh slider selects one of the eight effects. The last slider determines the strength of the effect. When moving the slider, the voice configuration updates one parameter in the voice configuration (here: speed). This triggers the synthesis pipeline and the resulting audio is played back to the user.
  • Figure 3: GSP results. A Standardized difference between successive slider configurations. B PCA on all slider configurations from all iterations. The gray kernel density estimate indicates the distribution of all slider configurations in PCA space. The black points are the final slider configurations. C Mean ratings as a function of the iterations and a random voice. Shaded areas are confidence intervals.
  • Figure 4: STEP-Tag results. A Raw occurrence of single labels for the 175 images and 175 voices. B Co-occurrence networks between provided tags per modality. Tags with a co-occurrence below 4 are pruned to remove words that are rarely used. The size of the nodes indicates the degree. Networks are created using Gephi gephi.
  • Figure 5: Correlations between ratings along dimensions. Correlation across dimensions for A images and B matched voices. Correlation matrices are sorted by the order in the dendrogram obtained via agglomerative clustering. C Most consistently rated dimensions across both modalities. The diagonal difference is the difference in correlation between the diagonal and the mean correlation of the rest of the row. D Correlation across both modalities. The correlation matrix is sorted by mean correlation for the most consistently rated dimension "feminine". E Loading plots for both modalities. PCA components were obtained separately for the data of the correlation matrices in panels A (left) and B (right).
  • ...and 27 more figures