Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

Takafumi Moriya; Takanori Ashihara; Masato Mimura; Hiroshi Sato; Kohei Matsuura; Ryo Masumura; Taichi Asami

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami

TL;DR

This paper proposes a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition and introduces dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm.

Abstract

A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 3 figures, 4 tables)

This paper contains 23 sections, 5 equations, 3 figures, 4 tables.

Introduction
ASR models
Neural transducer models
RNNT
HAT
CTC models
Vanilla CTC
Factorized CTC (FCTC)
Proposed internal acoustic model (IAM) within HAT
Training
Decoding with blank thresholding
HAT-blank thresholding
CTC-blank thresholding
Proposed dual blank thresholding
Compatible decoding algorithm for frame-skipping
...and 8 more sections

Figures (3)

Figure 1: The architecture of the proposal: HAT with IAM and ILM. Solid, dashed, and dotted arrows show HAT, IAM, and ILM paths, respectively. If we zero out the prediction or encoder network output, it becomes IAM or ILM, respectively.
Figure 2: Schematic diagram of (a) autoregressive HAT- and (b) non-autoregressive CTC-blank thresholding approaches. $\lambda^{*}$ is the threshold hyperparameter.
Figure 3: WER versus NBP/JCR curves. The lower-left region represents better thresholding and decoding algorithms. System IDs in the legends correspond to those in Table \ref{['tab:summary_offline']} and \ref{['tab:summary_streaming']}.

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

TL;DR

Abstract

Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)