Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription
Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach
TL;DR
The paper presents a modular speech separation framework that combines TF-GridNet with a mixture encoder for meeting transcription on single-microphone data. By extending the mixture encoder to handle arbitrary numbers of speakers and varying overlap, and by integrating with a Conformer-based acoustic model, the approach achieves state-of-the-art LibriCSS results and provides detailed analyses of remaining gaps to oracle performance. The findings show that TF-GridNet offers strong separation that reduces the marginal gain from mixture encoding, while frame-wise error analysis indicates minimal cross-talker leakage, attributing remaining errors to segmentation, reverberation, and artifacts. The work demonstrates the practical viability of high-quality continuous speech separation for meetings and provides insights into the trade-offs between separation strength and auxiliary mixture information.
Abstract
Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.
