WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao; Chenda Liao; Changliang Liu; Matthew Wiesner; Leibny Paola Garcia; Daniel Povey; Sanjeev Khudanpur; Jian Wu

WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur, Jian Wu

TL;DR

The paper tackles the data-quality challenge in end-to-end ASR by addressing the reliance of RNN-T on large quantities of perfectly labeled transcripts. It introduces the Weakly Supervised Transducer (WST), which augments the standard transducer with a flexible, differentiable WFST training graph that explicitly models transcript errors using token and blank bypass arcs and a star token, coupled with a stateless decoder approximation to manage branching histories. Empirically, WST outperforms BTC, OTC, and traditional Transducers across LibriSpeech and an industrial IH-10k dataset, showing strong robustness to transcription noise up to $70\%$ and maintaining clean-signal performance; e.g., at high insertion noise, WST achieves $WER=13.0\%$ on test-clean (OTC: $21.5\%$) and $WER=26.9\%$ on test-other (OTC: $39.1\%$), with IW-10k gains up to $9.71\%$ relative in total WER. These results demonstrate WST’s practical utility for real-world ASR with imperfect supervision and its potential to enable scalable, robust speech recognition without additional confidence estimation or pre-trained models.

Abstract

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

WST: Weakly Supervised Transducer for Automatic Speech Recognition

TL;DR

Abstract

WST: Weakly Supervised Transducer for Automatic Speech Recognition

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)