Table of Contents
Fetching ...

Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition

Yufeng Yang, Hassan Taherian, Vahid Ahmadi Kalkhorani, DeLiang Wang

TL;DR

The article presents elsarticle.cls, a LaTeX document class designed for formatting Elsevier submissions. It is a rewritten class built on article.cls to minimize package conflicts, replacing the earlier elsart.cls. The document details dependencies, optional packages, and a spectrum of formatting options (preprint and final styles, single- vs. two-column layouts) to streamline manuscript preparation. It also outlines compatibility with common tools such as natbib, hyperref, and endfloat, and provides a clear installation workflow via CTAN and Elsevier resources. Overall, the work aims to simplify and standardize the preparation of Elsevier journal submissions.

Abstract

Despite the tremendous success of automatic speech recognition (ASR) with the introduction of deep learning, its performance is still unsatisfactory in many real-world multi-talker scenarios. Speaker separation excels in separating individual talkers but, as a frontend, it introduces processing artifacts that degrade the ASR backend trained on clean speech. As a result, mainstream robust ASR systems train the backend on noisy speech to avoid processing artifacts. In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. Our decoupled system achieves 5.1% word error rates (WER) on the Libri2Mix dev/test sets, significantly outperforming other multi-talker ASR baselines. Its effectiveness is also demonstrated with the state-of-the-art 7.60%/5.74% WERs on 1-ch and 6-ch SMS-WSJ. Furthermore, on recorded LibriCSS, we achieve the speaker-attributed WER of 2.92%. These state-of-the-art results suggest that decoupling speaker separation and recognition is an effective approach to elevate robust multi-talker ASR.

Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition

TL;DR

The article presents elsarticle.cls, a LaTeX document class designed for formatting Elsevier submissions. It is a rewritten class built on article.cls to minimize package conflicts, replacing the earlier elsart.cls. The document details dependencies, optional packages, and a spectrum of formatting options (preprint and final styles, single- vs. two-column layouts) to streamline manuscript preparation. It also outlines compatibility with common tools such as natbib, hyperref, and endfloat, and provides a clear installation workflow via CTAN and Elsevier resources. Overall, the work aims to simplify and standardize the preparation of Elsevier journal submissions.

Abstract

Despite the tremendous success of automatic speech recognition (ASR) with the introduction of deep learning, its performance is still unsatisfactory in many real-world multi-talker scenarios. Speaker separation excels in separating individual talkers but, as a frontend, it introduces processing artifacts that degrade the ASR backend trained on clean speech. As a result, mainstream robust ASR systems train the backend on noisy speech to avoid processing artifacts. In this work, we propose to decouple the training of the speaker separation frontend and the ASR backend, with the latter trained on clean speech only. Our decoupled system achieves 5.1% word error rates (WER) on the Libri2Mix dev/test sets, significantly outperforming other multi-talker ASR baselines. Its effectiveness is also demonstrated with the state-of-the-art 7.60%/5.74% WERs on 1-ch and 6-ch SMS-WSJ. Furthermore, on recorded LibriCSS, we achieve the speaker-attributed WER of 2.92%. These state-of-the-art results suggest that decoupling speaker separation and recognition is an effective approach to elevate robust multi-talker ASR.

Paper Structure

This paper contains 3 sections.