Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Wei Qian; Qi Li; Kun Li; Xinke Wang; Xiao Sun; Meng Wang; Dan Guo

Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Wei Qian, Qi Li, Kun Li, Xinke Wang, Xiao Sun, Meng Wang, Dan Guo

TL;DR

This work tackles self-supervised heart rate estimation from unlabeled facial videos by proposing two complementary pipelines: (i) a non-end-to-end Spatial-Temporal Transformer that extracts rPPG cues from MSTmap representations and is trained with four PSD-based self-supervised losses (bandwidth, sparsity, variance, periodicity), and (ii) an end-to-end Contrastive Learning framework using ST-rPPG blocks and Contrast-Phys+-style objectives, followed by supervised fine-tuning. An ensemble of both solutions leverages their complementary strengths to maximize accuracy, achieving a RMSE of $8.85277$ and placing 2nd in Track 1 of the RePSS- IJCAI 2024 challenge. The study demonstrates that leveraging unlabeled data with carefully designed priors on rPPG spectral properties can yield robust HR estimates across varied conditions, with ablation results highlighting the value of periodicity priors and larger pretraining sets. The work also outlines practical implementation details, including MSTmap construction, spatial–temporal Transformer configuration, and contrastive pretraining strategies, contributing to the advancement of label-free remote HR estimation for real-world deployment.

Abstract

This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-supervised heart rate measurement in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge hosted at IJCAI 2024. The goal is to develop a self-supervised learning algorithm for heart rate (HR) estimation using unlabeled facial videos. To tackle this task, we present two self-supervised HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, respectively. Specifically, we first propose a non-end-to-end self-supervised HR measurement framework based on spatial-temporal modeling, which can effectively capture subtle rPPG clues and leverage the inherent bandwidth and periodicity characteristics of rPPG to constrain the model. Meanwhile, we employ an excellent end-to-end solution based on contrastive learning, aiming to generalize across different scenarios from complementary perspectives. Finally, we combine the strengths of the above solutions through an ensemble strategy to generate the final predictions, leading to a more accurate HR estimation. As a result, our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing \textbf{2nd place} in Track 1 of the challenge.

Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

TL;DR

and placing 2nd in Track 1 of the RePSS- IJCAI 2024 challenge. The study demonstrates that leveraging unlabeled data with carefully designed priors on rPPG spectral properties can yield robust HR estimates across varied conditions, with ablation results highlighting the value of periodicity priors and larger pretraining sets. The work also outlines practical implementation details, including MSTmap construction, spatial–temporal Transformer configuration, and contrastive pretraining strategies, contributing to the advancement of label-free remote HR estimation for real-world deployment.

Abstract

Paper Structure (15 sections, 12 equations, 2 figures, 3 tables)

This paper contains 15 sections, 12 equations, 2 figures, 3 tables.

Introduction
Methodology
Solution 1: Self-supervised HR Measurement with Spatial-Temporal Transformer
Data Pre-processing
Spatial-Temporal Transformer
Self-supervised Loss
Solution 2: Self-supervised HR Measurement with Contrastive Learning
Data Pre-processing
Pre-training
Fine-tuning
Experiments
Datasets
Evaluation Metrics and Implementation Details
Experimental Results
Conclusion

Figures (2)

Figure 1: Overview of the proposed solution 1. Given an input facial video with $T$ frames, we obtain $N$ facial ROIs for each frame and extract the MSTmap representation $M \in \mathbb{R}^{T\times N\times C}$ for the video, where $N$ is the number of facial ROI. A feature embedding layer is used to project the MSTmap to high-dimensional feature $X \in \mathbb{R}^{T \times N \times D}$. Then, we stack spatial-temporal Transformer for $L$ loops to capture subtle rPPG clues. Next, a rPPG regression head is used to output rPPG signal $s_{pre} \in \mathbb{R}^{T \times 1}$. Finally, we apply four self-supervised losses to constrain the model.
Figure 2: Overview of the solution 2. In the pre-train stage, the model is trained in a contrastive learning-based self-supervised manner. After that, the pre-trained model is fine-tuned by supervised loss.

Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

TL;DR

Abstract

Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Authors

TL;DR

Abstract

Table of Contents

Figures (2)