Cross-Talk Reduction

Zhong-Qiu Wang; Anurag Kumar; Shinji Watanabe

Cross-Talk Reduction

Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe

TL;DR

The paper introduces cross-talk reduction (CTR) and a neural solution, CTRnet, to extract clean close-talk speech from jointly recorded close-talk and far-field mixtures. It presents an unsupervised training framework that first estimates each speaker’s close-talk speech $Z(c)$ and then uses forward convolutive prediction (FCP) to model and cancel cross-talk across microphones, with a mixture-constraint loss guiding the DNN to realistic reconstructions. A weakly-supervised extension leverages speaker-activity timestamps to mute non-speech frames during training and to incorporate a speaker-activity loss, improving robustness on real data. Evaluations on simulated SMS-WSJ-FF-CT and real CHiME-7 data show substantial gains in SI-SDR/SDR and DA-WER, respectively, demonstrating the method’s practicality and potential for enabling supervision and evaluation of far-field separation with cross-talk-robust close-talk signals. Overall, CTRnet offers a scalable, data-efficient approach to cross-talk reduction in distributed microphone setups, with implications for ASR and annotation workflows.

Abstract

While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named cross-talk reduction (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet.

Cross-Talk Reduction

TL;DR

and then uses forward convolutive prediction (FCP) to model and cancel cross-talk across microphones, with a mixture-constraint loss guiding the DNN to realistic reconstructions. A weakly-supervised extension leverages speaker-activity timestamps to mute non-speech frames during training and to incorporate a speaker-activity loss, improving robustness on real data. Evaluations on simulated SMS-WSJ-FF-CT and real CHiME-7 data show substantial gains in SI-SDR/SDR and DA-WER, respectively, demonstrating the method’s practicality and potential for enabling supervision and evaluation of far-field separation with cross-talk-robust close-talk signals. Overall, CTRnet offers a scalable, data-efficient approach to cross-talk reduction in distributed microphone setups, with implications for ASR and annotation workflows.

Abstract

Paper Structure (21 sections, 17 equations, 3 figures, 2 tables)

This paper contains 21 sections, 17 equations, 3 figures, 2 tables.

Introduction
Related Work
Problem Formulation
Unsupervised CTRnet
DNN Configurations
Mixture-Constraint Loss
FCP for Filter Estimation
Weakly-Supervised CTRnet
Motivation
Muting during Training
Speaker-Activity Loss
Experimental Setup
SMS-WSJ-FF-CT and Evaluation Setup
CHiME-7 and Evaluation Setup
Miscellaneous Configurations of CTRnet
...and 6 more sections

Figures (3)

Figure 1: Task illustration. Best viewed in color.
Figure 2: Illustration of unsupervised CTRnet (see first paragraph of Section \ref{['proposed_algrithms_unsupervised']} for detailed description).
Figure 3: Illustration of sparse speaker overlap in human conversations. Best viewed in color. Each colored band means that the corresponding speaker is talking in the time range.

Cross-Talk Reduction

TL;DR

Abstract

Cross-Talk Reduction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)