Table of Contents
Fetching ...

Contrast-Phys+: Unsupervised and Weakly-supervised Video-based Remote Physiological Measurement via Spatiotemporal Contrast

Zhaodong Sun, Xiaobai Li

TL;DR

Contrast-Phys+ tackles the challenge of learning remote photoplethysmography (rPPG) from facial videos without full GT labels by integrating spatiotemporal rPPG signals, a PSD-based contrastive loss, and GT information when available. The method exploits four rPPG observations—spatial/temporal similarity, cross-video dissimilarity, and HR range constraints—to learn robust rPPG representations in unsupervised and weakly-supervised regimes. It demonstrates strong intra- and cross-dataset performance across RGB and NIR data, robust handling of GT misalignment, and the ability to leverage unlabeled or synthetic data to improve generalization and fairness. With high efficiency and good HRV waveform quality, Contrast-Phys+ has practical implications for scalable, privacy-preserving, and multi-domain remote physiological measurement, and can extend to other periodic signals such as respiration.

Abstract

Video-based remote physiological measurement utilizes facial videos to measure the blood volume change signal, which is also called remote photoplethysmography (rPPG). Supervised methods for rPPG measurements have been shown to achieve good performance. However, the drawback of these methods is that they require facial videos with ground truth (GT) physiological signals, which are often costly and difficult to obtain. In this paper, we propose Contrast-Phys+, a method that can be trained in both unsupervised and weakly-supervised settings. We employ a 3DCNN model to generate multiple spatiotemporal rPPG signals and incorporate prior knowledge of rPPG into a contrastive loss function. We further incorporate the GT signals into contrastive learning to adapt to partial or misaligned labels. The contrastive loss encourages rPPG/GT signals from the same video to be grouped together, while pushing those from different videos apart. We evaluate our methods on five publicly available datasets that include both RGB and Near-infrared videos. Contrast-Phys+ outperforms the state-of-the-art supervised methods, even when using partially available or misaligned GT signals, or no labels at all. Additionally, we highlight the advantages of our methods in terms of computational efficiency, noise robustness, and generalization. Our code is available at https://github.com/zhaodongsun/contrast-phys.

Contrast-Phys+: Unsupervised and Weakly-supervised Video-based Remote Physiological Measurement via Spatiotemporal Contrast

TL;DR

Contrast-Phys+ tackles the challenge of learning remote photoplethysmography (rPPG) from facial videos without full GT labels by integrating spatiotemporal rPPG signals, a PSD-based contrastive loss, and GT information when available. The method exploits four rPPG observations—spatial/temporal similarity, cross-video dissimilarity, and HR range constraints—to learn robust rPPG representations in unsupervised and weakly-supervised regimes. It demonstrates strong intra- and cross-dataset performance across RGB and NIR data, robust handling of GT misalignment, and the ability to leverage unlabeled or synthetic data to improve generalization and fairness. With high efficiency and good HRV waveform quality, Contrast-Phys+ has practical implications for scalable, privacy-preserving, and multi-domain remote physiological measurement, and can extend to other periodic signals such as respiration.

Abstract

Video-based remote physiological measurement utilizes facial videos to measure the blood volume change signal, which is also called remote photoplethysmography (rPPG). Supervised methods for rPPG measurements have been shown to achieve good performance. However, the drawback of these methods is that they require facial videos with ground truth (GT) physiological signals, which are often costly and difficult to obtain. In this paper, we propose Contrast-Phys+, a method that can be trained in both unsupervised and weakly-supervised settings. We employ a 3DCNN model to generate multiple spatiotemporal rPPG signals and incorporate prior knowledge of rPPG into a contrastive loss function. We further incorporate the GT signals into contrastive learning to adapt to partial or misaligned labels. The contrastive loss encourages rPPG/GT signals from the same video to be grouped together, while pushing those from different videos apart. We evaluate our methods on five publicly available datasets that include both RGB and Near-infrared videos. Contrast-Phys+ outperforms the state-of-the-art supervised methods, even when using partially available or misaligned GT signals, or no labels at all. Additionally, we highlight the advantages of our methods in terms of computational efficiency, noise robustness, and generalization. Our code is available at https://github.com/zhaodongsun/contrast-phys.
Paper Structure (38 sections, 17 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 38 sections, 17 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of rPPG spatial similarity. The rPPG signals from four facial areas (A, B, C, D) have similar waveforms and power spectrum densities (PSDs).
  • Figure 2: Illustration of rPPG temporal similarity. The rPPG signals from two temporal windows (A, B) have similar PSDs.
  • Figure 3: The most similar (left) and most different (right) cross-video PSD pairs in the OBF dataset.
  • Figure 4: The diagram of Contrast-Phys+ for weakly-supervised or unsupervised learning.
  • Figure 5: Spatial and temporal Sampler for an ST-rPPG Block.
  • ...and 8 more figures