Table of Contents
Fetching ...

Contrasting Deep Learning Models for Direct Respiratory Insufficiency Detection Versus Blood Oxygen Saturation Estimation

Marcelo Matheus Gauy, Natalia Hitomi Koza, Ricardo Mikio Morita, Gabriel Rocha Stanzione, Arnaldo Candido Junior, Larissa Cristina Berti, Anna Sara Shafferman Levin, Ester Cerdeira Sabino, Flaviane Romani Fernandes Svartman, Marcelo Finger

TL;DR

The paper investigates whether top-tier audio models can detect RI from voice and whether they can estimate SpO$_2$ from the same audio. While Audio-MAE and PANNs achieve near-perfect RI detection (RI presence), their ability to predict or classify SpO$_2$ levels from voice is markedly limited, with regression showing low correlation and classification failing to exceed ~0.65 F1. This reveals a clear domain separation: voice biomarkers robustly indicate RI status but do not reliably map to SpO$_2$ values under current data and methods. The results highlight practical implications for audio-based triage: strong RI detection is feasible, but SpO$_2$ estimation from speech alone remains challenging, potentially requiring additional data modalities or larger datasets.

Abstract

We contrast high effectiveness of state of the art deep learning architectures designed for general audio classification tasks, refined for respiratory insufficiency (RI) detection and blood oxygen saturation (SpO$_2$) estimation and classification through automated audio analysis. Recently, multiple deep learning architectures have been proposed to detect RI in COVID patients through audio analysis, achieving accuracy above 95% and F1-score above 0.93. RI is a condition associated with low SpO$_2$ levels, commonly defined as the threshold SpO$_2$ <92%. While SpO$_2$ serves as a crucial determinant of RI, a medical doctor's diagnosis typically relies on multiple factors. These include respiratory frequency, heart rate, SpO$_2$ levels, among others. Here we study pretrained audio neural networks (CNN6, CNN10 and CNN14) and the Masked Autoencoder (Audio-MAE) for RI detection, where these models achieve near perfect accuracy, surpassing previous results. Yet, for the regression task of estimating SpO$_2$ levels, the models achieve root mean square error values exceeding the accepted clinical range of 3.5% for finger oximeters. Additionally, Pearson correlation coefficients fail to surpass 0.3. As deep learning models perform better in classification than regression, we transform SpO$_2$-regression into a SpO$_2$-threshold binary classification problem, with a threshold of 92%. However, this task still yields an F1-score below 0.65. Thus, audio analysis offers valuable insights into a patient's RI status, but does not provide accurate information about actual SpO$_2$ levels, indicating a separation of domains in which voice and speech biomarkers may and may not be useful in medical diagnostics under current technologies.

Contrasting Deep Learning Models for Direct Respiratory Insufficiency Detection Versus Blood Oxygen Saturation Estimation

TL;DR

The paper investigates whether top-tier audio models can detect RI from voice and whether they can estimate SpO from the same audio. While Audio-MAE and PANNs achieve near-perfect RI detection (RI presence), their ability to predict or classify SpO levels from voice is markedly limited, with regression showing low correlation and classification failing to exceed ~0.65 F1. This reveals a clear domain separation: voice biomarkers robustly indicate RI status but do not reliably map to SpO values under current data and methods. The results highlight practical implications for audio-based triage: strong RI detection is feasible, but SpO estimation from speech alone remains challenging, potentially requiring additional data modalities or larger datasets.

Abstract

We contrast high effectiveness of state of the art deep learning architectures designed for general audio classification tasks, refined for respiratory insufficiency (RI) detection and blood oxygen saturation (SpO) estimation and classification through automated audio analysis. Recently, multiple deep learning architectures have been proposed to detect RI in COVID patients through audio analysis, achieving accuracy above 95% and F1-score above 0.93. RI is a condition associated with low SpO levels, commonly defined as the threshold SpO <92%. While SpO serves as a crucial determinant of RI, a medical doctor's diagnosis typically relies on multiple factors. These include respiratory frequency, heart rate, SpO levels, among others. Here we study pretrained audio neural networks (CNN6, CNN10 and CNN14) and the Masked Autoencoder (Audio-MAE) for RI detection, where these models achieve near perfect accuracy, surpassing previous results. Yet, for the regression task of estimating SpO levels, the models achieve root mean square error values exceeding the accepted clinical range of 3.5% for finger oximeters. Additionally, Pearson correlation coefficients fail to surpass 0.3. As deep learning models perform better in classification than regression, we transform SpO-regression into a SpO-threshold binary classification problem, with a threshold of 92%. However, this task still yields an F1-score below 0.65. Thus, audio analysis offers valuable insights into a patient's RI status, but does not provide accurate information about actual SpO levels, indicating a separation of domains in which voice and speech biomarkers may and may not be useful in medical diagnostics under current technologies.
Paper Structure (13 sections, 3 figures, 3 tables)

This paper contains 13 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: SpO2 distribution. Men's SpO2 mean is $93.4$. For women it is $94.0$.
  • Figure 2: Binary classification task architecture structure. Original Architecture refers to either Audio-MAE or the PANNs (CNN6, CNN10, CNN14). FC Linear 2 units is a fully connected (FC) linear layer with $2$ units to which we use softmax as part of the BCEwithLogits loss. This architecture is used for SpO2 classification and RI detection tasks.
  • Figure 3: SpO2 regression task architecture structure. Original Architecture refers to either Audio-MAE or the PANNs (CNN6, CNN10, CNN14). FC Linear $x$ units is a fully connected (FC) linear layer with $x$ units. We apply the Mish activation function to the intermediary layer. Observe that we have attempted varying the number of units in the intermediary layer between $10, 25, 50, 100$ as well as including dropout between the intermediary layer and the last layer. We also ran experiments using Gelu in place of Mish.