Articulatory strategy in vowel production as a basis for speaker discrimination
Justin J. H. Lo, Patrycja Strycharczuk, Sam Kirkham
TL;DR
The study investigates whether articulatory strategy in vowel production is sufficiently speaker-specific for discrimination by analyzing midsagittal tongue shapes from 40 English speakers using Generalised Procrustes Analysis and tangent-space PCA. It contrasts size-and-shape versus shape-only features and evaluates their speaker-discriminatory power through likelihood-ratio testing, reporting metrics such as $EER$ and $C_{llr}$. The results show tongue size (size-and-shape PC1) as the strongest discriminator, with anterior tongue dorsum curvature (shape PC3) also exhibiting notable individuality; combinations of shape features can approach the performance of size-and-shape, though inter--PC co-variation among speakers influences results. These findings support a holistic view of speaker discrimination that integrates anatomical and articulatory strategies and point to future work linking articulatory variation to acoustic correlates for a fuller phonetic model of speaker identity.
Abstract
The way speakers articulate is well known to be variable across individuals while at the same time subject to anatomical and biomechanical constraints. In this study, we ask whether articulatory strategy in vowel production can be sufficiently speaker-specific to form the basis for speaker discrimination. We conducted Generalised Procrustes Analyses of tongue shape data from 40 English speakers from the North West of England, and assessed the speaker-discriminatory potential of orthogonal tongue shape features within the framework of likelihood ratios. Tongue size emerged as the individual dimension with the strongest discriminatory power, while tongue shape variation in the more anterior part of the tongue generally outperformed tongue shape variation in the posterior part. When considered in combination, shape-only information may offer comparable levels of speaker specificity to size-and-shape information, but only when features do not exhibit speaker-level co-variation.
