Table of Contents
Fetching ...

Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study

Mohammad Moulaeifard, Peter H. Charlton, Nils Strodthoff

TL;DR

This benchmarking study evaluates generalization of DL models for cuffless BP estimation from PPG, training on PulseDB and testing across diverse external datasets. It compares CNN and S4 architectures, examines calibration and calibration-free PulseDB subsets, and introduces a simple sample-weighting domain adaptation to mitigate distribution shifts. Results show strong ID performance but substantial OOD gaps driven by BP-distribution differences; Vital-based CalibFree and AAMI subsets often generalize better to external data, and importance weighting yields modest but meaningful improvements. The work highlights the critical need for robust OOD evaluation and practical strategies to improve cross-dataset performance toward clinically viable cuffless BP estimation.

Abstract

Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.

Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study

TL;DR

This benchmarking study evaluates generalization of DL models for cuffless BP estimation from PPG, training on PulseDB and testing across diverse external datasets. It compares CNN and S4 architectures, examines calibration and calibration-free PulseDB subsets, and introduces a simple sample-weighting domain adaptation to mitigate distribution shifts. Results show strong ID performance but substantial OOD gaps driven by BP-distribution differences; Vital-based CalibFree and AAMI subsets often generalize better to external data, and importance weighting yields modest but meaningful improvements. The work highlights the critical need for robust OOD evaluation and practical strategies to improve cross-dataset performance toward clinically viable cuffless BP estimation.

Abstract

Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.

Paper Structure

This paper contains 15 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Schematic comparison of normalized label distributions (e.g., SBP or DBP) across training and test datasets.
  • Figure 2: MASE scores of different models trained on different PulseDB subsets and training scenarios: (top) SBP and (down) DBP.
  • Figure 3: Relationship between Dissimilarity Measure (EMD) and OOD Performance for SBP. The scatter plot shows the correlation between EMD and OOD MAE for SBP. Lower EMD reflects greater similarity to the baseline dataset (CalibFree Vital training set).
  • Figure 4: SBP distribution plots for all datasets
  • Figure 5: DBP distribution plots for all datasets