Table of Contents
Fetching ...

A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice

Mary Paterson, James Moor, Luisa Cutillo

TL;DR

This paper tackles the reproducibility gap in voice-based laryngeal cancer detection by introducing a public benchmark of 36 models trained on FEMH and SVD datasets, with open code and standardized evaluation. It systematically compares three audio feature sets and three classifiers across four input configurations (audio alone, with demographics, with symptoms, and with both), emphasizing fairness and inference-time metrics. The strongest result comes from a logistic regression model using OpenSMILE features on the full input set (audio+demographics+symptoms) achieving $BA\approx0.837$, $S\approx0.840$, $SP\approx0.833$, AUROC $\approx0.918$ on FEMH, with solid generalization to SVD; nonetheless, fairness concerns and dataset imbalances remain. The study demonstrates that simpler ML approaches with robust feature extraction can outperform some deep learning methods in this domain, provides a reproducible baseline for future work, and highlights practical considerations for clinical deployment, such as potential demographic biases and the need for broader external validation.

Abstract

Cases of laryngeal cancer are predicted to rise significantly in the coming years. Current diagnostic pathways are inefficient, putting undue stress on both patients and the medical system. Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient voice, which could help prioritise referrals more effectively. A major barrier in this field is the lack of reproducible methods. Our work addresses this challenge by introducing a benchmark suite comprising 36 models trained and evaluated on open-source datasets. These models classify patients with benign and malignant voice pathologies. All models are accessible in a public repository, providing a foundation for future research. We evaluate three algorithms and three audio feature sets, including both audio-only inputs and multimodal inputs incorporating demographic and symptom data. Our best model achieves a balanced accuracy of 83.7%, sensitivity of 84.0%, specificity of 83.3%, and AUROC of 91.8%.

A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice

TL;DR

This paper tackles the reproducibility gap in voice-based laryngeal cancer detection by introducing a public benchmark of 36 models trained on FEMH and SVD datasets, with open code and standardized evaluation. It systematically compares three audio feature sets and three classifiers across four input configurations (audio alone, with demographics, with symptoms, and with both), emphasizing fairness and inference-time metrics. The strongest result comes from a logistic regression model using OpenSMILE features on the full input set (audio+demographics+symptoms) achieving , , , AUROC on FEMH, with solid generalization to SVD; nonetheless, fairness concerns and dataset imbalances remain. The study demonstrates that simpler ML approaches with robust feature extraction can outperform some deep learning methods in this domain, provides a reproducible baseline for future work, and highlights practical considerations for clinical deployment, such as potential demographic biases and the need for broader external validation.

Abstract

Cases of laryngeal cancer are predicted to rise significantly in the coming years. Current diagnostic pathways are inefficient, putting undue stress on both patients and the medical system. Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient voice, which could help prioritise referrals more effectively. A major barrier in this field is the lack of reproducible methods. Our work addresses this challenge by introducing a benchmark suite comprising 36 models trained and evaluated on open-source datasets. These models classify patients with benign and malignant voice pathologies. All models are accessible in a public repository, providing a foundation for future research. We evaluate three algorithms and three audio feature sets, including both audio-only inputs and multimodal inputs incorporating demographic and symptom data. Our best model achieves a balanced accuracy of 83.7%, sensitivity of 84.0%, specificity of 83.3%, and AUROC of 91.8%.

Paper Structure

This paper contains 22 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The number of patients for each diagnosis split into benign and malignant.
  • Figure 2: The distribution of ages in the different datasets for the benign and malignant samples.
  • Figure 3: The percentages of male and female samples in the different datasets for the benign and malignant samples.
  • Figure 4: The classification process used in this work. Where $\bar{x_1}$ if a vector of audio features and $\bar{x_2}$ is a vector of demographic/symptom data.
  • Figure 5: The balanced accuracy for the holdout (FEMH) and external (SVD) test sets. 95% confidence intervals are shown.
  • ...and 1 more figures