MVP: Multi-source Voice Pathology detection
Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis
TL;DR
The paper tackles non-invasive automated voice pathology detection under cross-source variability by introducing MVP, a multi-source framework that processes raw audio from sustained vowels and sentence readings using transformer backbones. It systematically compares fusion strategies, with intermediate feature fusion (IFF) and a Transformer Encoder (TE) delivering the best cross-source integration. Key contributions include a detailed multi-source architecture, rigorous ablations on backbones and feature layers, and a cross-language evaluation (German, Portuguese, Italian) showing up to a 13% AUC improvement over single-source methods. The work demonstrates that leveraging complementary information from multiple speaking tasks enhances robustness and diagnostic performance for voice pathologies in diverse recording conditions.
Abstract
Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.
