Table of Contents
Fetching ...

A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction

Yue Li, Florian A. Kunneman, Koen V. Hindriks

TL;DR

This work tackles the problem of enabling natural, interruption-capable human-robot interaction by filtering the robot's ego speech from a single-channel microphone. It introduces a CNN-based spectrogram estimator for the robot’s own speech, combined with spectral subtraction to recover the overlapping human speech in near-real time, and demonstrates superiority over state-of-the-art TSE baselines on a shared dataset. The approach is implemented in the Social Interaction Cloud and validated in a small Pepper-based lab feasibility study, showing promising WERs and latency suitable for interactive use. The results highlight practical viability for continuous listening in HRI, while also outlining limitations related to frequency-domain oversubtraction and robustness to babble noise, with planned future enhancements and broader HRI evaluation.

Abstract

With current state-of-the-art automatic speech recognition (ASR) systems, it is not possible to transcribe overlapping speech audio streams separately. Consequently, when these ASR systems are used as part of a social robot like Pepper for interaction with a human, it is common practice to close the robot's microphone while it is talking itself. This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot's ego speech using only a single-channel microphone. This pipeline takes advantage of the possibility to feed the robot ego speech signal, generated by a text-to-speech API, as training data into a machine learning model. The proposed pipeline combines a convolutional neural network and spectral subtraction to extract overlapping human speech from the audio recorded by the robot-embedded microphone. When evaluating on a held-out test set, we find that this pipeline outperforms our previous approach to this task, as well as state-of-the-art target speech extraction systems that were retrained on the same dataset. We have also integrated the proposed pipeline into a lightweight robot software development framework to make it available for broader use. As a step towards demonstrating the feasibility of deploying our pipeline, we use this framework to evaluate the effectiveness of the pipeline in a small lab-based feasibility pilot using the social robot Pepper. Our results show that when participants interrupt the robot, the pipeline can extract the participant's speech from one-second streaming audio buffers received by the robot-embedded single-channel microphone, hence in near-real time.

A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction

TL;DR

This work tackles the problem of enabling natural, interruption-capable human-robot interaction by filtering the robot's ego speech from a single-channel microphone. It introduces a CNN-based spectrogram estimator for the robot’s own speech, combined with spectral subtraction to recover the overlapping human speech in near-real time, and demonstrates superiority over state-of-the-art TSE baselines on a shared dataset. The approach is implemented in the Social Interaction Cloud and validated in a small Pepper-based lab feasibility study, showing promising WERs and latency suitable for interactive use. The results highlight practical viability for continuous listening in HRI, while also outlining limitations related to frequency-domain oversubtraction and robustness to babble noise, with planned future enhancements and broader HRI evaluation.

Abstract

With current state-of-the-art automatic speech recognition (ASR) systems, it is not possible to transcribe overlapping speech audio streams separately. Consequently, when these ASR systems are used as part of a social robot like Pepper for interaction with a human, it is common practice to close the robot's microphone while it is talking itself. This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot's ego speech using only a single-channel microphone. This pipeline takes advantage of the possibility to feed the robot ego speech signal, generated by a text-to-speech API, as training data into a machine learning model. The proposed pipeline combines a convolutional neural network and spectral subtraction to extract overlapping human speech from the audio recorded by the robot-embedded microphone. When evaluating on a held-out test set, we find that this pipeline outperforms our previous approach to this task, as well as state-of-the-art target speech extraction systems that were retrained on the same dataset. We have also integrated the proposed pipeline into a lightweight robot software development framework to make it available for broader use. As a step towards demonstrating the feasibility of deploying our pipeline, we use this framework to evaluate the effectiveness of the pipeline in a small lab-based feasibility pilot using the social robot Pepper. Our results show that when participants interrupt the robot, the pipeline can extract the participant's speech from one-second streaming audio buffers received by the robot-embedded single-channel microphone, hence in near-real time.
Paper Structure (23 sections, 3 equations, 6 figures, 3 tables)

This paper contains 23 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of audio typically used for target speech extraction versus overlapping speech recordings of the Pepper robot (in time domain).
  • Figure 2: The problem of robot ego speech filtering
  • Figure 3: The architecture of the proposed CNN.
  • Figure 4: The proposed robot ego filtering pipeline
  • Figure 5: Results of agglomerative clustering.
  • ...and 1 more figures