Confidence intervals for forced alignment boundaries using model ensembles
Matthew C. Kelley
TL;DR
The paper tackles the lack of uncertainty quantification in forced-alignment boundaries by introducing a neural network ensemble approach (MAPS) to estimate boundaries and derive confidence intervals. The boundary for each segment is set at the ensemble median, and a nonparametric $97.85\%$ confidence interval is constructed from the 2nd and 9th order statistics of the 10 models. Empirically, the method shows small improvements over single-model baselines on Buckeye and TIMIT and provides detailed analyses of CI widths and their relationship to boundary types, using dynamic time warping to compare mismatched transcriptions. Outputs include JSON and Praat TextGrid formats to support programmatic use, with discussion of practical trade-offs and future directions such as Bayesian uncertainty and faster computation.
Abstract
Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. Having confidence intervals provides an estimate of the uncertainty in the boundary placement, facilitating tasks like finding boundaries that should be reviewed. As a bonus, on the Buckeye and TIMIT corpora, the ensemble boundaries show a slight overall improvement over using just a single model. The confidence intervals can be emitted during the alignment process as JSON files and a main table for programmatic and statistical analysis. For familiarity, they are also output as Praat TextGrids using a point tier to represent the intervals.
