Convoifilter: A case study of doing cocktail party speech recognition
Thai-Binh Nguyen, Alexander Waibel
TL;DR
This work tackles cocktail-party speech recognition by focusing on the target speaker and jointly optimizing a speaker-enhancement module with an ASR system. The ConVoiFilter pipeline uses a target-speaker embedding, cross-extraction, and a Conformer-based mask estimator to suppress interference before wav2vec2-based ASR, with a chunk-merging strategy and a joint loss that blends enhancement (SI-SNR) and ASR (transducer) objectives. The end-to-end model achieves substantial WER reductions (e.g., from 80% baseline to 14.5% in joint tuning; 26.4% in cascade) on noisy, overlapping data, demonstrating the value of tightly integrated speech enhancement and recognition. The paper also provides ablations and releases the pre-trained ConVoiFilter model to support further research and applications in speaker-specific robust ASR.
Abstract
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
