Target conversation extraction: Source separation using turn-taking dynamics

Tuochao Chen; Qirui Wang; Bohan Wu; Malek Itani; Sefik Emre Eskimez; Takuya Yoshioka; Shyamnath Gollakota

Target conversation extraction: Source separation using turn-taking dynamics

Tuochao Chen, Qirui Wang, Bohan Wu, Malek Itani, Sefik Emre Eskimez, Takuya Yoshioka, Shyamnath Gollakota

TL;DR

This paper introduces the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants, and proposes leveraging temporal patterns inherent in human conversations, particularly turn-taking dynamics.

Abstract

Extracting the speech of participants in a conversation amidst interfering speakers and noise presents a challenging problem. In this paper, we introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants. To accomplish this, we propose leveraging temporal patterns inherent in human conversations, particularly turn-taking dynamics, which uniquely characterize speakers engaged in conversation and distinguish them from interfering speakers and noise. Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets. In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations. Code, dataset available at https://github.com/chentuochao/Target-Conversation-Extraction.

Target conversation extraction: Source separation using turn-taking dynamics

TL;DR

Abstract

Paper Structure (3 sections, 2 figures, 4 tables)

This paper contains 3 sections, 2 figures, 4 tables.

Experiments and Results
Conclusion
Acknowledgments

Figures (2)

Figure 3: A. Input vs output SI-SDR. B. Impact of the duration of the reference speaker in the conversation on SI-SDRi.
Figure 4: Visualization of time-domain waveforms. (a) is the input mixture of real conversations. (b), (c) are ground-truths for speakers in the conversation. (d) is the model output. (e), (f) are the audio segments that preserve overlaps and back-channels.

Target conversation extraction: Source separation using turn-taking dynamics

TL;DR

Abstract

Target conversation extraction: Source separation using turn-taking dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (2)