A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice
Mary Paterson, James Moor, Luisa Cutillo
TL;DR
This paper tackles the reproducibility gap in voice-based laryngeal cancer detection by introducing a public benchmark of 36 models trained on FEMH and SVD datasets, with open code and standardized evaluation. It systematically compares three audio feature sets and three classifiers across four input configurations (audio alone, with demographics, with symptoms, and with both), emphasizing fairness and inference-time metrics. The strongest result comes from a logistic regression model using OpenSMILE features on the full input set (audio+demographics+symptoms) achieving $BA\approx0.837$, $S\approx0.840$, $SP\approx0.833$, AUROC $\approx0.918$ on FEMH, with solid generalization to SVD; nonetheless, fairness concerns and dataset imbalances remain. The study demonstrates that simpler ML approaches with robust feature extraction can outperform some deep learning methods in this domain, provides a reproducible baseline for future work, and highlights practical considerations for clinical deployment, such as potential demographic biases and the need for broader external validation.
Abstract
Cases of laryngeal cancer are predicted to rise significantly in the coming years. Current diagnostic pathways are inefficient, putting undue stress on both patients and the medical system. Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient voice, which could help prioritise referrals more effectively. A major barrier in this field is the lack of reproducible methods. Our work addresses this challenge by introducing a benchmark suite comprising 36 models trained and evaluated on open-source datasets. These models classify patients with benign and malignant voice pathologies. All models are accessible in a public repository, providing a foundation for future research. We evaluate three algorithms and three audio feature sets, including both audio-only inputs and multimodal inputs incorporating demographic and symptom data. Our best model achieves a balanced accuracy of 83.7%, sensitivity of 84.0%, specificity of 83.3%, and AUROC of 91.8%.
