How Suboptimal is Training rPPG Models with Videos and Targets from Different Body Sites?
Björn Braun, Daniel McDuff, Christian Holz
TL;DR
The study addresses how the choice of ground-truth PPG site (forehead vs fingertip) affects supervised rPPG models trained on facial videos. It evaluates three architectures (DeepPhys, TS-CAN, PhysNet) using a unique dataset with synchronized forehead and finger PPG under LOSO cross-validation, showing that forehead ground-truth reduces waveform $MSE$ by up to 40% and yields better morphological fidelity. Heart-rate estimation remains reliable across site combinations, but waveform fidelity benefits most from site-consistent labeling. The findings highlight the importance of matching the ground-truth PPG location to the input video and adopting waveform-level evaluation for more accurate downstream physiological assessments.
Abstract
Remote camera measurement of the blood volume pulse via photoplethysmography (rPPG) is a compelling technology for scalable, low-cost, and accessible assessment of cardiovascular information. Neural networks currently provide the state-of-the-art for this task and supervised training or fine-tuning is an important step in creating these models. However, most current models are trained on facial videos using contact PPG measurements from the fingertip as targets/ labels. One of the reasons for this is that few public datasets to date have incorporated contact PPG measurements from the face. Yet there is copious evidence that the PPG signals at different sites on the body have very different morphological features. Is training a facial video rPPG model using contact measurements from another site on the body suboptimal? Using a recently released unique dataset with synchronized contact PPG and video measurements from both the hand and face, we can provide precise and quantitative answers to this question. We obtain up to 40 % lower mean squared errors between the waveforms of the predicted and the ground truth PPG signals using state-of-the-art neural models when using PPG signals from the forehead compared to using PPG signals from the fingertip. We also show qualitatively that the neural models learn to predict the morphology of the ground truth PPG signal better when trained on the forehead PPG signals. However, while models trained from the forehead PPG produce a more faithful waveform, models trained from a finger PPG do still learn the dominant frequency (i.e., the heart rate) well.
