Table of Contents
Fetching ...

WST: Weakly Supervised Transducer for Automatic Speech Recognition

Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur, Jian Wu

TL;DR

The paper tackles the data-quality challenge in end-to-end ASR by addressing the reliance of RNN-T on large quantities of perfectly labeled transcripts. It introduces the Weakly Supervised Transducer (WST), which augments the standard transducer with a flexible, differentiable WFST training graph that explicitly models transcript errors using token and blank bypass arcs and a star token, coupled with a stateless decoder approximation to manage branching histories. Empirically, WST outperforms BTC, OTC, and traditional Transducers across LibriSpeech and an industrial IH-10k dataset, showing strong robustness to transcription noise up to $70\%$ and maintaining clean-signal performance; e.g., at high insertion noise, WST achieves $WER=13.0\%$ on test-clean (OTC: $21.5\%$) and $WER=26.9\%$ on test-other (OTC: $39.1\%$), with IW-10k gains up to $9.71\%$ relative in total WER. These results demonstrate WST’s practical utility for real-world ASR with imperfect supervision and its potential to enable scalable, robust speech recognition without additional confidence estimation or pre-trained models.

Abstract

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

WST: Weakly Supervised Transducer for Automatic Speech Recognition

TL;DR

The paper tackles the data-quality challenge in end-to-end ASR by addressing the reliance of RNN-T on large quantities of perfectly labeled transcripts. It introduces the Weakly Supervised Transducer (WST), which augments the standard transducer with a flexible, differentiable WFST training graph that explicitly models transcript errors using token and blank bypass arcs and a star token, coupled with a stateless decoder approximation to manage branching histories. Empirically, WST outperforms BTC, OTC, and traditional Transducers across LibriSpeech and an industrial IH-10k dataset, showing strong robustness to transcription noise up to and maintaining clean-signal performance; e.g., at high insertion noise, WST achieves on test-clean (OTC: ) and on test-other (OTC: ), with IW-10k gains up to relative in total WER. These results demonstrate WST’s practical utility for real-world ASR with imperfect supervision and its potential to enable scalable, robust speech recognition without additional confidence estimation or pre-trained models.

Abstract

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

Paper Structure

This paper contains 18 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Architecture of RNN-T. It contains an encoder, a decoder, and a joiner followed by a classification layer.
  • Figure 2: Transducer training graph in the k2 for the transcript "a b c" aligned to an input sequence of 4 frames. The graph starts at state 0, and the double-circled state 17 represents the final state. Each arc is labeled with an input symbol and an output symbol (separated by a colon), followed by a weight after the slash indicating the log-probability of emitting the output symbol. State 7 is highlighted as an example. The vertical arc (a token arc) emits the output symbol "c" without advancing the time step. The horizontal arcs (referred to as blank arcs) consumes a time frame but not emitting a label ($\epsilon$).
  • Figure 3: WFST representation of transcript.
  • Figure 4: Weakly Supervised Transducer training graph in the k2 for the transcript "a b c" aligned to an input sequence of 4 frames. Compared with a standard transducer graph, two types of bypass arcs are added: token bypass arcs and blank bypass arcs. The token bypass arcs (drawn vertically) enable the model to skip the current token while remaining in the same time frame, whereas the blank bypass arcs (drawn horizontally) allow the insertion of a $\star$ while advancing one time step with certain penalties. For example, at State 7, the token bypass arc permits skipping the token "c" with penalty $\lambda_{1}$ and the blank bypass arc facilitates the insertion of the $\star$ token with penalty $\lambda_{2}$.
  • Figure 5: Examples of dealing with different kinds of errors. The thickness of the arc indicates the probability assigned to it.
  • ...and 2 more figures