Table of Contents
Fetching ...

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami

TL;DR

This paper proposes a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition and introduces dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm.

Abstract

A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

TL;DR

This paper proposes a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition and introduces dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm.

Abstract

A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.
Paper Structure (23 sections, 5 equations, 3 figures, 4 tables)

This paper contains 23 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The architecture of the proposal: HAT with IAM and ILM. Solid, dashed, and dotted arrows show HAT, IAM, and ILM paths, respectively. If we zero out the prediction or encoder network output, it becomes IAM or ILM, respectively.
  • Figure 2: Schematic diagram of (a) autoregressive HAT- and (b) non-autoregressive CTC-blank thresholding approaches. $\lambda^{*}$ is the threshold hyperparameter.
  • Figure 3: WER versus NBP/JCR curves. The lower-left region represents better thresholding and decoding algorithms. System IDs in the legends correspond to those in Table \ref{['tab:summary_offline']} and \ref{['tab:summary_streaming']}.