Table of Contents
Fetching ...

MVP: Multi-source Voice Pathology detection

Alkis Koudounas, Moreno La Quatra, Gabriele Ciravegna, Marco Fantini, Erika Crosetti, Giovanni Succo, Tania Cerquitelli, Sabato Marco Siniscalchi, Elena Baralis

TL;DR

The paper tackles non-invasive automated voice pathology detection under cross-source variability by introducing MVP, a multi-source framework that processes raw audio from sustained vowels and sentence readings using transformer backbones. It systematically compares fusion strategies, with intermediate feature fusion (IFF) and a Transformer Encoder (TE) delivering the best cross-source integration. Key contributions include a detailed multi-source architecture, rigorous ablations on backbones and feature layers, and a cross-language evaluation (German, Portuguese, Italian) showing up to a 13% AUC improvement over single-source methods. The work demonstrates that leveraging complementary information from multiple speaking tasks enhances robustness and diagnostic performance for voice pathologies in diverse recording conditions.

Abstract

Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.

MVP: Multi-source Voice Pathology detection

TL;DR

The paper tackles non-invasive automated voice pathology detection under cross-source variability by introducing MVP, a multi-source framework that processes raw audio from sustained vowels and sentence readings using transformer backbones. It systematically compares fusion strategies, with intermediate feature fusion (IFF) and a Transformer Encoder (TE) delivering the best cross-source integration. Key contributions include a detailed multi-source architecture, rigorous ablations on backbones and feature layers, and a cross-language evaluation (German, Portuguese, Italian) showing up to a 13% AUC improvement over single-source methods. The work demonstrates that leveraging complementary information from multiple speaking tasks enhances robustness and diagnostic performance for voice pathologies in diverse recording conditions.

Abstract

Voice disorders significantly impact patient quality of life, yet non-invasive automated diagnosis remains under-explored due to both the scarcity of pathological voice data, and the variability in recording sources. This work introduces MVP (Multi-source Voice Pathology detection), a novel approach that leverages transformers operating directly on raw voice signals. We explore three fusion strategies to combine sentence reading and sustained vowel recordings: waveform concatenation, intermediate feature fusion, and decision-level combination. Empirical validation across the German, Portuguese, and Italian languages shows that intermediate feature fusion using transformers best captures the complementary characteristics of both recording types. Our approach achieves up to +13% AUC improvement over single-source methods.

Paper Structure

This paper contains 14 sections, 10 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Proposed MVP framework: waveform concatenation (top panel), intermediate feature fusion (mid panel), and decision-level combination (bottom panel).