Enhancing CTC-based speech recognition with diverse modeling units
Shiyi Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang
TL;DR
The paper tackles improving CTC-based end-to-end ASR beyond multi-pass rescoring by jointly training with diverse modeling units (phonemes, characters, and logographic representations). It introduces a training-time approach that attaches auxiliary CTC losses to intermediate encoder layers, aligning the unit representations with the encoder depths that best capture their information. Across LibriSpeech and AISHELL-2, the method yields significant WER/CER improvements without added inference cost or a second-pass acoustic model, and uncovers a consistent pattern where phonetic units peak in mid-layers while linguistic units benefit from deeper layers. The findings offer a practical path to more accurate and efficient ASR by integrating heterogeneous modeling units directly into the training objective, with demonstrated effectiveness on both alphabetic and logographic languages.
Abstract
In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal integration of heterogeneous modeling units in the development of more robust and accurate ASR systems.
