Table of Contents
Fetching ...

On Speaker Attribution with SURT

Desh Raj, Matthew Wiesner, Matthew Maciejewski, Leibny Paola Garcia-Perera, Daniel Povey, Sanjeev Khudanpur

TL;DR

This work extends the Streaming Unmixing and Recognition Transducer (SURT) to enable speaker-attributed streaming transcription by adding an auxiliary speaker transducer and a synchronization mechanism based on shared blank tokens. A novel speaker-prefixing strategy is introduced to maintain consistent relative speaker labels across utterance groups in long recordings. Extensive ablations on LibriSpeech mixtures and evaluations on the AMI corpus demonstrate that the auxiliary branch can produce reliable speaker labels with competitive cpWER, albeit with session-level reconciliation challenges. The approach preserves SURT's streaming properties and opens practical avenues for real-time meeting transcription and summarization. The work also provides avenues for future improvements in session-level speaker identity reconciliation and enrollment-based enhancements.

Abstract

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework further by proposing methods to perform speaker-attributed transcription with SURT, for both short mixtures and long recordings. We achieve this by adding an auxiliary speaker branch to SURT, and synchronizing its label prediction with ASR token prediction through HAT-style blank factorization. In order to ensure consistency in relative speaker labels across different utterance groups in a recording, we propose "speaker prefixing" -- appending each chunk with high-confidence frames of speakers identified in previous chunks, to establish the relative order. We perform extensive ablation experiments on synthetic LibriSpeech mixtures to validate our design choices, and demonstrate the efficacy of our final model on the AMI corpus.

On Speaker Attribution with SURT

TL;DR

This work extends the Streaming Unmixing and Recognition Transducer (SURT) to enable speaker-attributed streaming transcription by adding an auxiliary speaker transducer and a synchronization mechanism based on shared blank tokens. A novel speaker-prefixing strategy is introduced to maintain consistent relative speaker labels across utterance groups in long recordings. Extensive ablations on LibriSpeech mixtures and evaluations on the AMI corpus demonstrate that the auxiliary branch can produce reliable speaker labels with competitive cpWER, albeit with session-level reconciliation challenges. The approach preserves SURT's streaming properties and opens practical avenues for real-time meeting transcription and summarization. The work also provides avenues for future improvements in session-level speaker identity reconciliation and enrollment-based enhancements.

Abstract

The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework further by proposing methods to perform speaker-attributed transcription with SURT, for both short mixtures and long recordings. We achieve this by adding an auxiliary speaker branch to SURT, and synchronizing its label prediction with ASR token prediction through HAT-style blank factorization. In order to ensure consistency in relative speaker labels across different utterance groups in a recording, we propose "speaker prefixing" -- appending each chunk with high-confidence frames of speakers identified in previous chunks, to establish the relative order. We perform extensive ablation experiments on synthetic LibriSpeech mixtures to validate our design choices, and demonstrate the efficacy of our final model on the AMI corpus.
Paper Structure (23 sections, 8 equations, 7 figures, 6 tables)

This paper contains 23 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An overview of SURT 2.0, as described in Raj2023Surt20. It consists of a masking network and a transducer-based ASR.
  • Figure 2: Auxiliary speaker transducer (red box) with shared blank label. The auxiliary encoder takes as input a hidden layer representation $\mathbf{h}_n$ from the main encoder, and generates $\mathbf{f}_{1:T}^{\mathrm{aux}}$. The blank logit $\mathbf{z}[0]$ from the main joiner is shared with the speaker branch to compute the HAT loss.
  • Figure 3: Projections of auxiliary encoder representations for a subset of LSMix dev. Each point denotes the representation of one speaker in a mixture, averaged over the frames on which the model emits a non-blank label. (a) and (b) denote UMAP projection, and (c) shows LDA projection using absolute speaker classes.
  • Figure 4: Utterance group statistics of the AMI meeting corpus: (a) number of speakers in the group, and (b) number of speakers seen before the group.
  • Figure 5: Effect of auxiliary encoder left context on (a) WDER and (b) cpWER. Dotted lines show best performance using $\infty$ left context.
  • ...and 2 more figures