Table of Contents
Fetching ...

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription

Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

TL;DR

The paper presents a modular speech separation framework that combines TF-GridNet with a mixture encoder for meeting transcription on single-microphone data. By extending the mixture encoder to handle arbitrary numbers of speakers and varying overlap, and by integrating with a Conformer-based acoustic model, the approach achieves state-of-the-art LibriCSS results and provides detailed analyses of remaining gaps to oracle performance. The findings show that TF-GridNet offers strong separation that reduces the marginal gain from mixture encoding, while frame-wise error analysis indicates minimal cross-talker leakage, attributing remaining errors to segmentation, reverberation, and artifacts. The work demonstrates the practical viability of high-quality continuous speech separation for meetings and provides insights into the trade-offs between separation strength and auxiliary mixture information.

Abstract

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription

TL;DR

The paper presents a modular speech separation framework that combines TF-GridNet with a mixture encoder for meeting transcription on single-microphone data. By extending the mixture encoder to handle arbitrary numbers of speakers and varying overlap, and by integrating with a Conformer-based acoustic model, the approach achieves state-of-the-art LibriCSS results and provides detailed analyses of remaining gaps to oracle performance. The findings show that TF-GridNet offers strong separation that reduces the marginal gain from mixture encoding, while frame-wise error analysis indicates minimal cross-talker leakage, attributing remaining errors to segmentation, reverberation, and artifacts. The work demonstrates the practical viability of high-quality continuous speech separation for meetings and provides insights into the trade-offs between separation strength and auxiliary mixture information.

Abstract

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.
Paper Structure (15 sections, 1 equation, 2 figures, 5 tables)

This paper contains 15 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Model architectures with (a) baseline and (b) mixture encoder for acoustic modeling. Same colors denote shared parameters.
  • Figure 2: Visualization of the frame-wise error computation. The vertical gray lines represent frame boundaries. Frame-wise error measures are drawn for a comparison between primary channel forced alignment and primary channel word lattice as well as a comparison between primary channel word lattice and cross channel forced alignment. We indicate whether the word labels match (✔), mismatch (✖) or both channels' forced alignments contain silence (--). The upper comparison within the primary channel corresponds to the row with separated audio and lattice hypothesis in \ref{['table:results_fers_ref']} and in this example, we obtain $\text{HER} =3/17\approx 18\%$. The lower comparison between primary and cross channel corresponds to the row with lattice hypothesis for the primary channel in \ref{['table:results_fers_cross']} and the example results in $\text{CIR} =5/22\approx 23\%$. The frames are not drawn to scale.