Table of Contents
Fetching ...

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Jinhan Wang, Weiqing Wang, Kunal Dhawan, Taejin Park, Myungjong Kim, Ivan Medennikov, He Huang, Nithin Koluguri, Jagadeesh Balam, Boris Ginsburg

TL;DR

The paper addresses end-to-end multi-talker ASR by unifying multi-speaker and target-speaker transcription tasks under a single architecture. It introduces Meta-Cat, a speaker-information concatenation mechanism, and leverages a Sortformer diarizer with a pre-trained encoder to inject speaker supervision without spectral masking or fixed speaker embeddings. Meta-Cat and its variants consistently improve MS-ASR and TS-ASR performance across AMI, ICSI, DipCo, and LibriSpeechMix, and the study explores a unified dual-task model that can perform both tasks. The results demonstrate a streamlined, robust approach to multi-speaker transcription with potential resilience to diarization errors, while also outlining future work on adapters or multi-head architectures to enhance dual-task learning.

Abstract

We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervision from a pre-trained speaker diarization module. We introduce an intuitive yet effective method for masking ASR encoder activations using output from the speaker supervision module, a technique we term Meta-Cat (meta-information concatenation), that can be applied to both MS-ASR and TS-ASR. Our results demonstrate that the proposed architecture achieves competitive performance in both MS-ASR and TS-ASR tasks, without the need for traditional methods, such as neural mask estimation or masking at the audio or feature level. Furthermore, we demonstrate a glimpse of a unified dual-task model which can efficiently handle both MS-ASR and TS-ASR tasks. Thus, this work illustrates that a robust end-to-end multi-talker ASR framework can be implemented with a streamlined architecture, obviating the need for the complex speaker filtering mechanisms employed in previous studies.

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

TL;DR

The paper addresses end-to-end multi-talker ASR by unifying multi-speaker and target-speaker transcription tasks under a single architecture. It introduces Meta-Cat, a speaker-information concatenation mechanism, and leverages a Sortformer diarizer with a pre-trained encoder to inject speaker supervision without spectral masking or fixed speaker embeddings. Meta-Cat and its variants consistently improve MS-ASR and TS-ASR performance across AMI, ICSI, DipCo, and LibriSpeechMix, and the study explores a unified dual-task model that can perform both tasks. The results demonstrate a streamlined, robust approach to multi-speaker transcription with potential resilience to diarization errors, while also outlining future work on adapters or multi-head architectures to enhance dual-task learning.

Abstract

We propose a novel end-to-end multi-talker automatic speech recognition (ASR) framework that enables both multi-speaker (MS) ASR and target-speaker (TS) ASR. Our proposed model is trained in a fully end-to-end manner, incorporating speaker supervision from a pre-trained speaker diarization module. We introduce an intuitive yet effective method for masking ASR encoder activations using output from the speaker supervision module, a technique we term Meta-Cat (meta-information concatenation), that can be applied to both MS-ASR and TS-ASR. Our results demonstrate that the proposed architecture achieves competitive performance in both MS-ASR and TS-ASR tasks, without the need for traditional methods, such as neural mask estimation or masking at the audio or feature level. Furthermore, we demonstrate a glimpse of a unified dual-task model which can efficiently handle both MS-ASR and TS-ASR tasks. Thus, this work illustrates that a robust end-to-end multi-talker ASR framework can be implemented with a streamlined architecture, obviating the need for the complex speaker filtering mechanisms employed in previous studies.
Paper Structure (23 sections, 4 figures, 5 tables)

This paper contains 23 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Data flow of the proposed system that supports both MS-ASR and TS-ASR.
  • Figure 2: Exemplary illustration of MS-ASR and TS-ASR.
  • Figure 3: Meta-Cat converts the predicted timestamps to a masked embedding sequence that conveys speaker supervision information.
  • Figure 4: Meta-Cat-Residual (+Projection): Adding residual connections and another projection layer on top of Meta-Cat.