The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

Matthew C. Kelley; Scott James Perry; Benjamin V. Tucker

The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

Matthew C. Kelley, Scott James Perry, Benjamin V. Tucker

TL;DR

This paper introduces the Mason-Alberta Phonetic Segmenter (MAPS), a neural-network based forced-alignment system, and investigates two enhancements: recasting the acoustic model as a tagging (multi-label) problem and applying boundary interpolation to surpass the standard 10 ms granularity. Through experiments on TIMIT and Buckeye, MAPS is benchmarked against the Montreal Forced Aligner, demonstrating that interpolation yields a 27.92% relative improvement for boundaries within 10 ms, while the tagging approach does not consistently improve alignment. The study also provides a critical discussion on how acoustic targets and phonetic similarity are represented in training, highlighting a tension between accurate classification and similarity reflection that may require new representations or multi-tier transcriptions. The findings suggest that boundary interpolation is a practical gain with current architectures, but achieving robust, generalizable forced alignment may require rethinking output targets, transcription formats, and evaluation metrics to better capture phonetic reality.

Abstract

Forced alignment systems automatically determine boundaries between segments in speech data, given an orthographic transcription. These tools are commonplace in phonetics to facilitate the use of speech data that would be infeasible to manually transcribe and segment. In the present paper, we describe a new neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). The MAPS aligner serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model in a forced aligner as a tagging task, rather than a classification task, motivated by the common understanding that segments in speech are not truly discrete and commonly overlap. The second is an interpolation technique to allow boundaries more precise than the common 10 ms limit in modern forced alignment systems. We compare configurations of our system to a state-of-the-art system, the Montreal Forced Aligner. The tagging approach did not generally yield improved results over the Montreal Forced Aligner. However, a system with the interpolation technique had a 27.92% increase relative to the Montreal Forced Aligner in the amount of boundaries within 10 ms of the target on the test set. We also reflect on the task and training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciliation of this tension may require rethinking the task and output targets or how speech itself should be segmented.

The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

TL;DR

Abstract

Paper Structure (28 sections, 13 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 13 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related work
The present paper
Theoretical analysis
Segment classification
An ideal segment classifier
Training classifiers in practice
A potential solution: Tagging
Boundary placement precision
Empirical analysis
Training data
Model architecture
Model training routine
Decoding the network output
Montreal Forced Aligner
...and 13 more sections

Figures (6)

Figure 1: Flowchart diagram of forced alignment process. The (a) and (b) sections represent parallel streams that do not depend on each other. The output of (a) and (b) are then merged in (c) with the decoding process, which yields an alignment that can be displayed with a spectrogram and/or waveform.
Figure 2: Schematic state diagram of neural network with softmax activation and categorical cross-entropy loss when presented with exemplar labeled as [k]. The initial state of the network is shown in (a). In (b), a gradient update for vanilla stochastic gradient descent is shown for when the value of the "Velar" variable was 2. The numbers are derived using the gradient rules from Equations \ref{['eq:cce_corr']} and \ref{['eq:cce_inc']}, in addition to basic partial derivatives of products. In (c), the state of the network after the update is shown, assuming $\alpha=0.1$ for the learning rate. Note how the connection strength between the velar pinch and [g] has been weakened, which is undesirable.
Figure 3: By-epoch evaluation metrics for crisp network. Bands corresponding to a 95% confidence interval are given in the shading surrounding the line plots. Note that the bands are present for the training metrics, but they are very tight.
Figure 4: By-epoch evaluation metrics for tagger network. Bands corresponding to a 95% confidence interval are given in the shading surrounding the line plots. Note that the bands are present for the training metrics, but they are very tight.
Figure 5: Cumulative density function for crisp networks and Montreal Forced Aligner. The line labeled "Interp" is the result of the system using interpolation. The line labeled "No interp" is the result of the system not using interpolation. The line labeled "MFA" is the result of the Montreal Forced Aligner.
...and 1 more figures

The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

TL;DR

Abstract

The Mason-Alberta Phonetic Segmenter: A forced alignment system based on deep neural networks and interpolation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)