Table of Contents
Fetching ...

RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

Vladimir Bataev

TL;DR

This work tackles the problem of training speech recognition systems when transcripts contain errors, a common scenario in industrial-scale data pipelines. It introduces three loss-function modifications for RNN-Transducer models—Star-Transducer for deletions, Bypass-Transducer for insertions, and Target-Robust-Transducer (TRT) which combines both—to make training robust to noisy targets without changing the core model architecture. The methods are implemented within a graph-based WFST/RNNT framework and evaluated on LibriSpeech with synthetic transcription errors, showing that Star-Transducer can recover most of the lost quality from deletions, Bypass-Transducer can recover a large portion from insertions, and TRT can handle substitutions and arbitrary errors with substantial improvements over standard RNN-T baselines. The results suggest practical benefits for deploying ASR systems on large, imperfectly labeled data, enabling better performance in production environments and facilitating data usage for low-resource languages or noisy data streams. The work also provides open-source implementations and visualizations to support adoption and further research in robust RNNT loss design.

Abstract

Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.

RNN-Transducer-based Losses for Speech Recognition on Noisy Targets

TL;DR

This work tackles the problem of training speech recognition systems when transcripts contain errors, a common scenario in industrial-scale data pipelines. It introduces three loss-function modifications for RNN-Transducer models—Star-Transducer for deletions, Bypass-Transducer for insertions, and Target-Robust-Transducer (TRT) which combines both—to make training robust to noisy targets without changing the core model architecture. The methods are implemented within a graph-based WFST/RNNT framework and evaluated on LibriSpeech with synthetic transcription errors, showing that Star-Transducer can recover most of the lost quality from deletions, Bypass-Transducer can recover a large portion from insertions, and TRT can handle substitutions and arbitrary errors with substantial improvements over standard RNN-T baselines. The results suggest practical benefits for deploying ASR systems on large, imperfectly labeled data, enabling better performance in production environments and facilitating data usage for low-resource languages or noisy data streams. The work also provides open-source implementations and visualizations to support adoption and further research in robust RNNT loss design.

Abstract

Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.

Paper Structure

This paper contains 50 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: RNN-Transducer Schema
  • Figure 2: WFSTs for RNN-Transducer, following laptev2023rnntwfst
  • Figure 3: Project Plan (Gantt Chart)
  • Figure 4: WFSTs for Star-Transducer. $\langle sf \rangle$ is a special symbol indicating skipping the frame.
  • Figure 5: WFSTs for Bypass-Transducer. $\langle st \rangle$ is a special symbol indicating skipping the token.
  • ...and 2 more figures