Table of Contents
Fetching ...

Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model Improves End-to-End ASR

Jintao Jiang, Yingbo Gao, Mohammad Zeineldeen, Zoltan Tuske

TL;DR

This work tackles alignment issues in end-to-end ASR by injecting weak supervision signals derived from a pre-trained hybrid system. It introduces two auxiliary losses—triphone-based Tri-CE and BPE-based BPE-CE—placed at mid-layer and encoder/ctx respectively, with strong label smoothing, and also explores replacing BPE-CTC with BPE-CE. By combining multiple alignments and alternating the auxiliary losses during training, the approach yields significant relative WER reductions on TED-LIUM 2, surpassing a strengthened CTC-regularized baseline by more than 10% and achieving additional gains when losses are alternated. The study demonstrates that hybrid-alignments can meaningfully improve end-to-end models without external language models, though it acknowledges that the gains are demonstrated with BLSTM encoders and suggests validating the technique on transformer-based architectures in the future.

Abstract

In this paper, alternating weak triphone/BPE alignment supervision is proposed to improve end-to-end model training. Towards this end, triphone and BPE alignments are extracted using a pre-existing hybrid ASR system. Then, regularization effect is obtained by cross-entropy based intermediate auxiliary losses computed on such alignments at a mid-layer representation of the encoder for triphone alignments and at the encoder for BPE alignments. Weak supervision is achieved through strong label smoothing with parameter of 0.5. Experimental results on TED-LIUM 2 indicate that either triphone or BPE alignment based weak supervision improves ASR performance over standard CTC auxiliary loss. Moreover, their combination lowers the word error rate further. We also investigate the alternation of the two auxiliary tasks during model training, and additional performance gain is observed. Overall, the proposed techniques result in over 10% relative error rate reduction over a CTC-regularized baseline system.

Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model Improves End-to-End ASR

TL;DR

This work tackles alignment issues in end-to-end ASR by injecting weak supervision signals derived from a pre-trained hybrid system. It introduces two auxiliary losses—triphone-based Tri-CE and BPE-based BPE-CE—placed at mid-layer and encoder/ctx respectively, with strong label smoothing, and also explores replacing BPE-CTC with BPE-CE. By combining multiple alignments and alternating the auxiliary losses during training, the approach yields significant relative WER reductions on TED-LIUM 2, surpassing a strengthened CTC-regularized baseline by more than 10% and achieving additional gains when losses are alternated. The study demonstrates that hybrid-alignments can meaningfully improve end-to-end models without external language models, though it acknowledges that the gains are demonstrated with BLSTM encoders and suggests validating the technique on transformer-based architectures in the future.

Abstract

In this paper, alternating weak triphone/BPE alignment supervision is proposed to improve end-to-end model training. Towards this end, triphone and BPE alignments are extracted using a pre-existing hybrid ASR system. Then, regularization effect is obtained by cross-entropy based intermediate auxiliary losses computed on such alignments at a mid-layer representation of the encoder for triphone alignments and at the encoder for BPE alignments. Weak supervision is achieved through strong label smoothing with parameter of 0.5. Experimental results on TED-LIUM 2 indicate that either triphone or BPE alignment based weak supervision improves ASR performance over standard CTC auxiliary loss. Moreover, their combination lowers the word error rate further. We also investigate the alternation of the two auxiliary tasks during model training, and additional performance gain is observed. Overall, the proposed techniques result in over 10% relative error rate reduction over a CTC-regularized baseline system.
Paper Structure (8 sections, 1 equation, 1 figure, 3 tables)

This paper contains 8 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Diagram of the network.