Table of Contents
Fetching ...

Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography

Vineet R. Shenoy, Shaoju Wu, Armand Comas, Tim K. Marks, Suhas Lohit, Hassan Mansour

TL;DR

This work tackles non-contact heart rate and pulse-rate variability estimation from facial video. It introduces TURNIP, a Time-Series U-Net with GRU-based recurrent skip connections, within a modular pipeline that also includes face-landmark detection and region-based time-series extraction. The approach achieves state-of-the-art results across RGB and NIR datasets, demonstrates robust handling of motion and self-occlusion, and provides extensive ablations that highlight the benefits of occlusion-awareness, the red-over-green color-channel strategy, and temporal recurrence. The findings hold potential for reliable, sensor-free vital-sign monitoring in telemedicine and safety-critical scenarios, with strong interpretability relative to end-to-end deep networks.

Abstract

Remote estimation of vital signs enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive. In this paper, we present a modular, interpretable pipeline for pulse signal estimation from video of the face that achieves state-of-the-art results on publicly available datasets.Our imaging photoplethysmography (iPPG) system consists of three modules: face and landmark detection, time-series extraction, and pulse signal/pulse rate estimation. Unlike many deep learning methods that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. The pulse signal estimation module, which we call TURNIP (Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography), allows the system to faithfully reconstruct the underlying pulse signal waveform and uses it to measure heart rate and pulse rate variability metrics, even in the presence of motion. When parts of the face are occluded due to extreme head poses, our system explicitly detects such "self-occluded" regions and maintains estimation robustness despite the missing information. Our algorithm provides reliable heart rate estimates without the need for specialized sensors or contact with the skin, outperforming previous iPPG methods on both color (RGB) and near-infrared (NIR) datasets.

Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography

TL;DR

This work tackles non-contact heart rate and pulse-rate variability estimation from facial video. It introduces TURNIP, a Time-Series U-Net with GRU-based recurrent skip connections, within a modular pipeline that also includes face-landmark detection and region-based time-series extraction. The approach achieves state-of-the-art results across RGB and NIR datasets, demonstrates robust handling of motion and self-occlusion, and provides extensive ablations that highlight the benefits of occlusion-awareness, the red-over-green color-channel strategy, and temporal recurrence. The findings hold potential for reliable, sensor-free vital-sign monitoring in telemedicine and safety-critical scenarios, with strong interpretability relative to end-to-end deep networks.

Abstract

Remote estimation of vital signs enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive. In this paper, we present a modular, interpretable pipeline for pulse signal estimation from video of the face that achieves state-of-the-art results on publicly available datasets.Our imaging photoplethysmography (iPPG) system consists of three modules: face and landmark detection, time-series extraction, and pulse signal/pulse rate estimation. Unlike many deep learning methods that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. The pulse signal estimation module, which we call TURNIP (Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography), allows the system to faithfully reconstruct the underlying pulse signal waveform and uses it to measure heart rate and pulse rate variability metrics, even in the presence of motion. When parts of the face are occluded due to extreme head poses, our system explicitly detects such "self-occluded" regions and maintains estimation robustness despite the missing information. Our algorithm provides reliable heart rate estimates without the need for specialized sensors or contact with the skin, outperforming previous iPPG methods on both color (RGB) and near-infrared (NIR) datasets.

Paper Structure

This paper contains 32 sections, 6 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Our system for pulse signal estimation from video is composed of three modules, outlined in black: face and landmark detection, time series extraction, and pulse signal estimation. The pulse rate and pulse rate variability can then be estimated from the denoised pulse signal that is output from TURNIP.
  • Figure 2: Generating of landmark and feature regions. We first start by detecting 68 landmarks from the LUVLi luvli landmark detector. We then interpolate these landmarks across the cheeks and chin and extrapolate them up the forehead, to generate 145 landmarks. We use these landmarks to define 48 regions. Finally, we aggregate the pixel intensities in each region using spatial averaging to obtain a 48-channel time-series.
  • Figure 3: These example frames show the augmented set of 145 landmarks, with invisible landmarks shown in red. Landmarks are also labeled invisible if they are self-occluded. Finally, landmarks are labeled invisible if their locations were determined by interpolating/extrapolating using an invisible landmark. Previous algorithms such as sparseppgautosparseppg do not detect when landmarks are invisible, which means that such landmarks can cause previous methods to have incorrect results in frames that have extreme head rotations or translations. Because we explicitly detect invisible landmarks and label them as such, our algorithm can learn to be more robust to extreme poses.
  • Figure 4: The TURNIP Pulse Signal Estimation module. The signals from the 48 individual regions are extracted at input to the network as a $T\times48$ matrix. The spatio-temporal network denoises the signal to the statistics of the data, and outputs a clean signal. See Figure \ref{['fig:three_iter']} for example input and output of TURNIP.
  • Figure 5: Bland-Altman Analysis. Each point in the MMSE-HR and PURE graphs represent a non-overlapping 10-second window.
  • ...and 3 more figures