Table of Contents
Fetching ...

Evaluating the Impact of Pulse Oximetry Bias in Machine Learning under Counterfactual Thinking

Inês Martins, João Matos, Tiago Gonçalves, Leo A. Celi, A. Ian Wong, Jaime S. Cardoso

TL;DR

This work tackles how bias in medical devices, specifically pulse oximetry bias distinguishing $SpO_2$ from the gold-standard $SaO_2$, can affect downstream ML predictions in healthcare. It introduces a counterfactual framework that keeps all factors constant and compares models trained on unbiased $SaO_2$ versus biased $SpO_2$, using paired measurements from the BOLD dataset. Across three clinical tasks and two ML models, the study finds that $SaO_2$-based models generally perform better, with bias in $SpO_2$ leading to lower recall for adverse outcomes and more false negatives, particularly in at-risk subgroups. The approach provides a transparent, device-agnostic method to quantify and potentially mitigate downstream fairness and performance disparities in clinical ML applications.

Abstract

Algorithmic bias in healthcare mirrors existing data biases. However, the factors driving unfairness are not always known. Medical devices capture significant amounts of data but are prone to errors; for instance, pulse oximeters overestimate the arterial oxygen saturation of darker-skinned individuals, leading to worse outcomes. The impact of this bias in machine learning (ML) models remains unclear. This study addresses the technical challenges of quantifying the impact of medical device bias in downstream ML. Our experiments compare a "perfect world", without pulse oximetry bias, using SaO2 (blood-gas), to the "actual world", with biased measurements, using SpO2 (pulse oximetry). Under this counterfactual design, two models are trained with identical data, features, and settings, except for the method of measuring oxygen saturation: models using SaO2 are a "control" and models using SpO2 a "treatment". The blood-gas oximetry linked dataset was a suitable test-bed, containing 163,396 nearly-simultaneous SpO2 - SaO2 paired measurements, aligned with a wide array of clinical features and outcomes. We studied three classification tasks: in-hospital mortality, respiratory SOFA score in the next 24 hours, and SOFA score increase by two points. Models using SaO2 instead of SpO2 generally showed better performance. Patients with overestimation of O2 by pulse oximetry of > 3% had significant decreases in mortality prediction recall, from 0.63 to 0.59, P < 0.001. This mirrors clinical processes where biased pulse oximetry readings provide clinicians with false reassurance of patients' oxygen levels. A similar degradation happened in ML models, with pulse oximetry biases leading to more false negatives in predicting adverse outcomes.

Evaluating the Impact of Pulse Oximetry Bias in Machine Learning under Counterfactual Thinking

TL;DR

This work tackles how bias in medical devices, specifically pulse oximetry bias distinguishing from the gold-standard , can affect downstream ML predictions in healthcare. It introduces a counterfactual framework that keeps all factors constant and compares models trained on unbiased versus biased , using paired measurements from the BOLD dataset. Across three clinical tasks and two ML models, the study finds that -based models generally perform better, with bias in leading to lower recall for adverse outcomes and more false negatives, particularly in at-risk subgroups. The approach provides a transparent, device-agnostic method to quantify and potentially mitigate downstream fairness and performance disparities in clinical ML applications.

Abstract

Algorithmic bias in healthcare mirrors existing data biases. However, the factors driving unfairness are not always known. Medical devices capture significant amounts of data but are prone to errors; for instance, pulse oximeters overestimate the arterial oxygen saturation of darker-skinned individuals, leading to worse outcomes. The impact of this bias in machine learning (ML) models remains unclear. This study addresses the technical challenges of quantifying the impact of medical device bias in downstream ML. Our experiments compare a "perfect world", without pulse oximetry bias, using SaO2 (blood-gas), to the "actual world", with biased measurements, using SpO2 (pulse oximetry). Under this counterfactual design, two models are trained with identical data, features, and settings, except for the method of measuring oxygen saturation: models using SaO2 are a "control" and models using SpO2 a "treatment". The blood-gas oximetry linked dataset was a suitable test-bed, containing 163,396 nearly-simultaneous SpO2 - SaO2 paired measurements, aligned with a wide array of clinical features and outcomes. We studied three classification tasks: in-hospital mortality, respiratory SOFA score in the next 24 hours, and SOFA score increase by two points. Models using SaO2 instead of SpO2 generally showed better performance. Patients with overestimation of O2 by pulse oximetry of > 3% had significant decreases in mortality prediction recall, from 0.63 to 0.59, P < 0.001. This mirrors clinical processes where biased pulse oximetry readings provide clinicians with false reassurance of patients' oxygen levels. A similar degradation happened in ML models, with pulse oximetry biases leading to more false negatives in predicting adverse outcomes.
Paper Structure (13 sections, 4 figures, 1 table)

This paper contains 13 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Assessment of the impact of medical device bias on downstream ML.
  • Figure 2: Mean value of the XGBoost performance metrics across race and ethnicity subgroups. Significant differences between SpO$_2$ and SaO$_2$ models are identified with: "$\ast$", for p-values $\leq$ 0.05; "$\ast\ast$", for p-values $\leq$ 0.01; or "$\ast\ast\ast$", for p-values $\leq$ 0.001. A: Asian; B: Black; HL: Hispanic or Latino; O: Other; W: White.
  • Figure 3: Mean value of the XGBoost performance metrics across disparity groups. Significant differences between SaO$_2$ and SpO$_2$ models are identified with: "$\ast$", for p-values $\leq$ 0.05; "$\ast\ast$", for p-values $\leq$ 0.01; or "$\ast\ast\ast$", for p-values $\leq$ 0.001.
  • Figure 4: Mean value of the XGBoost performance metrics between patients with consistent SaO$_2$ and SpO$_2$ values (above or equal to 88%) - class 0 - and the ones with hidden hypoxemia - class 1. Significant differences between SaO$_2$ and SpO$_2$ models are identified with: "$\ast$", for p-values $\leq$ 0.05; "$\ast\ast$", for p-values $\leq$ 0.01; or "$\ast\ast\ast$", for p-values $\leq$ 0.001.