Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur
TL;DR
The paper addresses peaky CTC posteriors in forced alignment by integrating unigram label priors into the training objective and employing a pruned path strategy to favor fewer blanks. It derives the priors-based gradients, implements a lightweight TDNN-FFN aligner, and benchmarks on Buckeye and TIMIT, showing significant improvements in phoneme/word boundary and onset/offset timestamp accuracy, with performance rivaling Montreal Forced Aligner on Buckeye and approaching it on TIMIT. The method yields 12–40% reductions in PBE/WBE compared to standard CTC and a heuristic baseline, while maintaining a simpler training pipeline and faster runtime, and the authors release their training recipe and pretrained TorchAudio models. These findings suggest label priors can effectively tame CTC’s peaky behavior, enabling more reliable FA without complex supervision or large-scale architectures, and open avenues for reusing ASR components in FA contexts.
Abstract
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
