Table of Contents
Fetching ...

PhySU-Net: Long Temporal Context Transformer for rPPG with Self-Supervised Pre-training

Marko Savic, Guoying Zhao

TL;DR

PhySU-Net targets robust, non-contact rPPG extraction from facial videos by exploiting a long temporal context with a transformer-based architecture operating on MSTmaps, paired with a self-supervised pre-training strategy. The model reconstructs BVPmaps from MSTmaps and includes an HR regression head, optimized with a multitask loss $L = \alpha L_{reg} + \beta L_{temp} + \gamma L_{freq}$ where $L_{temp}$ uses $MCC$ and $L_{freq}$ uses PSD; self-supervision masks MSTmaps (75% of patches) and uses a CHROM-based pseudo-label to guide learning from unlabeled data. The authors demonstrate state-of-the-art supervised performance on OBF and VIPL-HR and show that self-supervised pre-training yields transferable representations, with ablations highlighting the benefits of longer temporal context and the pre-training strategy. This work advances robust rPPG estimation under challenging conditions and enables effective learning from unlabeled data, with potential for broader adoption in clinical and real-world monitoring scenarios.

Abstract

Remote photoplethysmography (rPPG) is a promising technology that consists of contactless measuring of cardiac activity from facial videos. Most recent approaches utilize convolutional networks with limited temporal modeling capability or ignore long temporal context. Supervised rPPG methods are also severely limited by scarce data availability. In this work, we propose PhySU-Net, the first long spatial-temporal map rPPG transformer network and a self-supervised pre-training strategy that exploits unlabeled data to improve our model. Our strategy leverages traditional methods and image masking to provide pseudo-labels for self-supervised pre-training. Our model is tested on two public datasets (OBF and VIPL-HR) and shows superior performance in supervised training. Furthermore, we demonstrate that our self-supervised pre-training strategy further improves our model's performance by leveraging representations learned from unlabeled data.

PhySU-Net: Long Temporal Context Transformer for rPPG with Self-Supervised Pre-training

TL;DR

PhySU-Net targets robust, non-contact rPPG extraction from facial videos by exploiting a long temporal context with a transformer-based architecture operating on MSTmaps, paired with a self-supervised pre-training strategy. The model reconstructs BVPmaps from MSTmaps and includes an HR regression head, optimized with a multitask loss where uses and uses PSD; self-supervision masks MSTmaps (75% of patches) and uses a CHROM-based pseudo-label to guide learning from unlabeled data. The authors demonstrate state-of-the-art supervised performance on OBF and VIPL-HR and show that self-supervised pre-training yields transferable representations, with ablations highlighting the benefits of longer temporal context and the pre-training strategy. This work advances robust rPPG estimation under challenging conditions and enables effective learning from unlabeled data, with potential for broader adoption in clinical and real-world monitoring scenarios.

Abstract

Remote photoplethysmography (rPPG) is a promising technology that consists of contactless measuring of cardiac activity from facial videos. Most recent approaches utilize convolutional networks with limited temporal modeling capability or ignore long temporal context. Supervised rPPG methods are also severely limited by scarce data availability. In this work, we propose PhySU-Net, the first long spatial-temporal map rPPG transformer network and a self-supervised pre-training strategy that exploits unlabeled data to improve our model. Our strategy leverages traditional methods and image masking to provide pseudo-labels for self-supervised pre-training. Our model is tested on two public datasets (OBF and VIPL-HR) and shows superior performance in supervised training. Furthermore, we demonstrate that our self-supervised pre-training strategy further improves our model's performance by leveraging representations learned from unlabeled data.
Paper Structure (10 sections, 4 equations, 2 figures, 4 tables)

This paper contains 10 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our method: The input video is processed into a stacked MSTmap. For the supervised downstream task, the decoder reconstructs an image with similar temporal and frequency properties as the BVPmap label, and the HR head regresses the HR value with the HR ground truth as label. For the self-supervised pretext task, only the input and the labels change. The input is a masked version of the MSTmap, that the decoder attempts to reconstruct into a full MSTmap. For the HR regression, a CHROM pseudo-label is used.
  • Figure 2: Preprocessing: a1) $C \times R$ temporal sequences are extracted by averaging pixels for each channel and ROI combination. a2) The ground truth BVP is duplicated to fit the dimension of the MSTmap b) The sequences are filtered with a pass band of $[0.7, 3] Hz$ and min-max normalized. c) The MSTmap and BVPmap is temporally stacked to form square images.