Table of Contents
Fetching ...

Confidence intervals for forced alignment boundaries using model ensembles

Matthew C. Kelley

TL;DR

The paper tackles the lack of uncertainty quantification in forced-alignment boundaries by introducing a neural network ensemble approach (MAPS) to estimate boundaries and derive confidence intervals. The boundary for each segment is set at the ensemble median, and a nonparametric $97.85\%$ confidence interval is constructed from the 2nd and 9th order statistics of the 10 models. Empirically, the method shows small improvements over single-model baselines on Buckeye and TIMIT and provides detailed analyses of CI widths and their relationship to boundary types, using dynamic time warping to compare mismatched transcriptions. Outputs include JSON and Praat TextGrid formats to support programmatic use, with discussion of practical trade-offs and future directions such as Bayesian uncertainty and faster computation.

Abstract

Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. Having confidence intervals provides an estimate of the uncertainty in the boundary placement, facilitating tasks like finding boundaries that should be reviewed. As a bonus, on the Buckeye and TIMIT corpora, the ensemble boundaries show a slight overall improvement over using just a single model. The confidence intervals can be emitted during the alignment process as JSON files and a main table for programmatic and statistical analysis. For familiarity, they are also output as Praat TextGrids using a point tier to represent the intervals.

Confidence intervals for forced alignment boundaries using model ensembles

TL;DR

The paper tackles the lack of uncertainty quantification in forced-alignment boundaries by introducing a neural network ensemble approach (MAPS) to estimate boundaries and derive confidence intervals. The boundary for each segment is set at the ensemble median, and a nonparametric confidence interval is constructed from the 2nd and 9th order statistics of the 10 models. Empirically, the method shows small improvements over single-model baselines on Buckeye and TIMIT and provides detailed analyses of CI widths and their relationship to boundary types, using dynamic time warping to compare mismatched transcriptions. Outputs include JSON and Praat TextGrid formats to support programmatic use, with discussion of practical trade-offs and future directions such as Bayesian uncertainty and faster computation.

Abstract

Forced alignment is a common tool to align audio with orthographic and phonetic transcriptions. Most forced alignment tools provide only a single estimate of a boundary. The present project introduces a method of deriving confidence intervals for these boundaries using a neural network ensemble technique. Ten different segment classifier neural networks were previously trained, and the alignment process is repeated with each model. The alignment ensemble is then used to place the boundary at the median of the boundaries in the ensemble, and 97.85% confidence intervals are constructed using order statistics. Having confidence intervals provides an estimate of the uncertainty in the boundary placement, facilitating tasks like finding boundaries that should be reviewed. As a bonus, on the Buckeye and TIMIT corpora, the ensemble boundaries show a slight overall improvement over using just a single model. The confidence intervals can be emitted during the alignment process as JSON files and a main table for programmatic and statistical analysis. For familiarity, they are also output as Praat TextGrids using a point tier to represent the intervals.

Paper Structure

This paper contains 9 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Alignment between segmentations that differ in the number of boundaries for the sentence "It suffers from a lack of unity of purpose and respect for heroic leadership" from speaker FADG0 in the TIMIT speech corpus. Notice how the upper transcription has an extra [d] segment that the bottom transcription does not. Dashed red lines indicate the boundaries that would be aligned for comparison by dynamic time warping. Differences in segment labels do not matter for the evaluation.
  • Figure 2: Heatmaps for central tendencies of the confidence interval widths by bisegment type. The y-axes indicate the first category in a bisegment pair, while the x-axes indicate the second category in a bisegment pair. White cells indicate unobserved data.
  • Figure 3: Sample of TextGrid format for segmentation of ...blical scholars arg... from sentence "SX42" from speaker "FAEM0" in the TIMIT corpus. In the present figure, the tiers of the TextGrid have been separated for easier viewing. In the standard output from the model, all three tiers are part of the same TextGrid. The segments were determined via dictionary lookup.