Table of Contents
Fetching ...

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Weixin Liu, Bowen Qu, Amy Stone, Maria E. Powell, Shama Dufresne, Stephane Braun, Izabela Galdyn, Michael Golinko, Bradley Malin, Zhijun Yin, Matthew E. Pontell

Abstract

Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Abstract

Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.
Paper Structure (24 sections, 10 equations, 2 figures, 3 tables)

This paper contains 24 sections, 10 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the proposed two-stage framework. Stage 1: nasality representation pre-training using supervised contrastive learning (SupCon) with positive pairs sampled from the same speaker, same vowel, and same auxiliary class (oral-context vs. nasal-context), while restricting contrastive comparisons to within-vowel pairs to reduce phonetic-content leakage. A Wav2Vec2.0 backbone with trainable layer fusion and a projection head outputs 256-dimensional $\ell_2$-normalized embeddings. Stage 2: robust VPD screening under domain shift (lab $\rightarrow$ wild) using the frozen encoder as a feature extractor on 0.5 s chunks, followed by a lightweight classifier (LR/SVM/MLP/XGBoost) and mean aggregation of chunk-level probabilities to produce recording-level screening decisions (in-domain and out-of-domain) using a fixed decision threshold.
  • Figure 2: UMAP visualization of SupCon nasality embeddings on the auxiliary validation split. Each panel shows a single vowel with a class-balanced subset of vowel-centered segments (0.20 s). Points are colored by the auxiliary nasality context label (oral_core vs. nasal_strong). Vowel labels follow ARPAbet notation from the forced-alignment annotations.