Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

Zachary Turcotte; François Grondin

Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

Zachary Turcotte, François Grondin

TL;DR

Experimental results demonstrate that this approach outperforms other traditional recording configruations, achieving higher scale-invariant signal-to-distortion ratio and lower word error rate accross multiple input signal-to-noise ratio conditions.

Abstract

Speech enhancement performance degrades significantly in noisy environments, limiting the deployment of speech-controlled technologies in industrial settings, such as manufacturing plants. Existing speech enhancement solutions primarly rely on advanced digital signal processing techniques, deep learning methods, or complex software optimization techniques. This paper introduces a novel enhancement strategy that incorporates a physical optimization stage by dynamically modifying the geometry of a microphone array to adapt to changing acoustic conditions. A sixteen-microphone array is mounted on a robotic arm manipulator with seven degrees of freedom, with microphones divided into four groups of four, including one group positioned near the end-effector. The system reconfigures the array by adjusting the manipulator joint angles to place the end-effector microphones closer to the target speaker, thereby improving the reference signal quality. This proposed method integrates sound source localization techniques, computer vision, inverse kinematics, minimum variance distortionless response beamformer and time-frequency masking using a deep neural network. Experimental results demonstrate that this approach outperforms other traditional recording configruations, achieving higher scale-invariant signal-to-distortion ratio and lower word error rate accross multiple input signal-to-noise ratio conditions.

Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

TL;DR

Abstract

Paper Structure (14 sections, 17 equations, 5 figures, 5 tables)

This paper contains 14 sections, 17 equations, 5 figures, 5 tables.

INTRODUCTION
Proposed Method
Hardware setup
Enhancement pipeline
Signal Model
Ideal Ratio Masks estimation
Sound Source Localization
Face Detection and Inverse Kinematic
Speech enhancement
Experiments and results
Sound Source localization
Speech Enhancement
Overall system
Conclusion

Figures (5)

Figure 1: All recording devices used during experiments
Figure 2: Experimental setup. Red markers indicate the site where the sub-arrays are located on the robotic arm. Green markers indicate the joint number.
Figure 3: Enhancement pipeline
Figure 4: Output SI-SDR vs input SI-SNR at channel 16
Figure 5: WER improvement vs the input SNR at channel 16

Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

TL;DR

Abstract

Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array

Authors

TL;DR

Abstract

Table of Contents

Figures (5)