Table of Contents
Fetching ...

A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis

Ciaran Bench, Oskar Pfeffer, Vivek Desai, Mohammad Moulaeifard, Loïc Coquelin, Peter H. Charlton, Nils Strodthoff, Nando Hegemann, Philip J. Aston, Andrew Thompson

TL;DR

This study systematically evaluates eight uncertainty quantification techniques applied to deep learning models analyzing photoplethysmography (PPG) signals for two clinically relevant tasks: atrial fibrillation (AF) detection and cuffless blood pressure (BP) estimation. By deploying a comprehensive evaluation framework that emphasizes local/adaptive reliability in addition to global calibration, the authors reveal that the best UQ method is highly task- and metric-dependent, with post-hoc calibration excelling in global reliability but often underperforming in adaptive, per-class settings. The findings underscore the importance of aligning UQ strategies with practical clinical use, especially in settings with limited per-patient measurements where small-scale reliability is critical. The work offers practical guidance on choosing UQ methods, highlights the value of local reliability analyses, and calls for UQ techniques that balance adaptability with predictive performance to improve trustworthiness in wearable-based health monitoring.

Abstract

In principle, deep learning models trained on medical time-series, including wearable photoplethysmography (PPG) sensor data, can provide a means to continuously monitor physiological parameters outside of clinical settings. However, there is considerable risk of poor performance when deployed in practical measurement scenarios leading to negative patient outcomes. Reliable uncertainties accompanying predictions can provide guidance to clinicians in their interpretation of the trustworthiness of model outputs. It is therefore of interest to compare the effectiveness of different approaches. Here we implement an unprecedented set of eight uncertainty quantification (UQ) techniques to models trained on two clinically relevant prediction tasks: Atrial Fibrillation (AF) detection (classification), and two variants of blood pressure regression. We formulate a comprehensive evaluation procedure to enable a rigorous comparison of these approaches. We observe a complex picture of uncertainty reliability across the different techniques, where the most optimal for a given task depends on the chosen expression of uncertainty, evaluation metric, and scale of reliability assessed. We find that assessing local calibration and adaptivity provides practically relevant insights about model behaviour that otherwise cannot be acquired using more commonly implemented global reliability metrics. We emphasise that criteria for evaluating UQ techniques should cater to the model's practical use case, where the use of a small number of measurements per patient places a premium on achieving small-scale reliability for the chosen expression of uncertainty, while preserving as much predictive performance as possible.

A systematic evaluation of uncertainty quantification techniques in deep learning: a case study in photoplethysmography signal analysis

TL;DR

This study systematically evaluates eight uncertainty quantification techniques applied to deep learning models analyzing photoplethysmography (PPG) signals for two clinically relevant tasks: atrial fibrillation (AF) detection and cuffless blood pressure (BP) estimation. By deploying a comprehensive evaluation framework that emphasizes local/adaptive reliability in addition to global calibration, the authors reveal that the best UQ method is highly task- and metric-dependent, with post-hoc calibration excelling in global reliability but often underperforming in adaptive, per-class settings. The findings underscore the importance of aligning UQ strategies with practical clinical use, especially in settings with limited per-patient measurements where small-scale reliability is critical. The work offers practical guidance on choosing UQ methods, highlights the value of local reliability analyses, and calls for UQ techniques that balance adaptability with predictive performance to improve trustworthiness in wearable-based health monitoring.

Abstract

In principle, deep learning models trained on medical time-series, including wearable photoplethysmography (PPG) sensor data, can provide a means to continuously monitor physiological parameters outside of clinical settings. However, there is considerable risk of poor performance when deployed in practical measurement scenarios leading to negative patient outcomes. Reliable uncertainties accompanying predictions can provide guidance to clinicians in their interpretation of the trustworthiness of model outputs. It is therefore of interest to compare the effectiveness of different approaches. Here we implement an unprecedented set of eight uncertainty quantification (UQ) techniques to models trained on two clinically relevant prediction tasks: Atrial Fibrillation (AF) detection (classification), and two variants of blood pressure regression. We formulate a comprehensive evaluation procedure to enable a rigorous comparison of these approaches. We observe a complex picture of uncertainty reliability across the different techniques, where the most optimal for a given task depends on the chosen expression of uncertainty, evaluation metric, and scale of reliability assessed. We find that assessing local calibration and adaptivity provides practically relevant insights about model behaviour that otherwise cannot be acquired using more commonly implemented global reliability metrics. We emphasise that criteria for evaluating UQ techniques should cater to the model's practical use case, where the use of a small number of measurements per patient places a premium on achieving small-scale reliability for the chosen expression of uncertainty, while preserving as much predictive performance as possible.

Paper Structure

This paper contains 68 sections, 27 equations, 6 figures, 18 tables, 3 algorithms.

Figures (6)

  • Figure 1: Uncertainty quantification in deep learning analyses of photoplethysmography (PPG) signals: (a) PPG signals can be measured by many clinical and consumer devices; (b) PPG signals capture the pulsation of blood with each heartbeat; (c) deep learning is commonly used to analyse PPG signals; (d) this study provides a systematic evaluation of uncertainty quantification techniques for deep learning; (e) aiming to improve the trustworthiness of analyses.
  • Figure 2: Reliability diagrams for ECE (top), VCE (middle), and UCE (bottom) calibration for 5 chosen UQ methods for both models. For alexnet, we include the results of Isotonic Regression (IR) on the MCD predictions, and for resnet, we include the results of IR on the DE predictions. The black dashed line represents the ideal calibration relationship.
  • Figure 3: Adaptive variation calibration plots as assessed by the VCE for DE, MCD, Venn-ABERS, and IR for both models. Venn-ABERS results are given for both models, whilst MCD+IR results are shown for alexnet, and DE+IR results are shown for resnet.
  • Figure 4: Adaptive reliability plots DE, MCD, Venn-ABERS, and IR for both models. Venn-ABERS results are given for both models, whilst MCD+IR results are shown for alexnet, and DE+IR results are shown for resnet.
  • Figure 5: ENCE reliability diagrams of the 4 main UQ methods (DE, MCD, MAP, and QR) prior to recalibration for alexnet (top) and resnet (bottom) for the calibfree dataset. The quantile regression results are shown for the $2\sigma$ confidence level.
  • ...and 1 more figures