Cross-Talk Reduction
Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe
TL;DR
The paper introduces cross-talk reduction (CTR) and a neural solution, CTRnet, to extract clean close-talk speech from jointly recorded close-talk and far-field mixtures. It presents an unsupervised training framework that first estimates each speaker’s close-talk speech $Z(c)$ and then uses forward convolutive prediction (FCP) to model and cancel cross-talk across microphones, with a mixture-constraint loss guiding the DNN to realistic reconstructions. A weakly-supervised extension leverages speaker-activity timestamps to mute non-speech frames during training and to incorporate a speaker-activity loss, improving robustness on real data. Evaluations on simulated SMS-WSJ-FF-CT and real CHiME-7 data show substantial gains in SI-SDR/SDR and DA-WER, respectively, demonstrating the method’s practicality and potential for enabling supervision and evaluation of far-field separation with cross-talk-robust close-talk signals. Overall, CTRnet offers a scalable, data-efficient approach to cross-talk reduction in distributed microphone setups, with implications for ASR and annotation workflows.
Abstract
While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named cross-talk reduction (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet.
