Table of Contents
Fetching ...

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Lingwei Meng, Jiawen Kang, Yuejiao Wang, Zengrui Jin, Xixin Wu, Xunying Liu, Helen Meng

TL;DR

This paper tackles the challenge of joint multi-talker and target-talker speech recognition within a single framework by leveraging Whisper as a frozen foundation model. It attaches a Sidecar separator to Whisper's encoder to disentangle mixed speech into per-talker embeddings and introduces a Target Talker Identifier that uses only a short enrollment cue of $3$ seconds to locate the target embedding flow in real time, complemented by soft prompt tuning for the decoder. The approach achieves state-of-the-art results on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks and demonstrates acceptable zero-shot performance on the AishellMix Mandarin dataset. This work highlights the viability of adapting large speech foundation models to multi-talker and target-talker ASR with minimal extra components, enabling robust transcription in mixed-speaker environments.

Abstract

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

TL;DR

This paper tackles the challenge of joint multi-talker and target-talker speech recognition within a single framework by leveraging Whisper as a frozen foundation model. It attaches a Sidecar separator to Whisper's encoder to disentangle mixed speech into per-talker embeddings and introduces a Target Talker Identifier that uses only a short enrollment cue of seconds to locate the target embedding flow in real time, complemented by soft prompt tuning for the decoder. The approach achieves state-of-the-art results on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks and demonstrates acceptable zero-shot performance on the AishellMix Mandarin dataset. This work highlights the viability of adapting large speech foundation models to multi-talker and target-talker ASR with minimal extra components, enabling robust transcription in mixed-speaker environments.

Abstract

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
Paper Structure (2 sections, 1 table)

This paper contains 2 sections, 1 table.