Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement
Wei Qian, Qi Li, Kun Li, Xinke Wang, Xiao Sun, Meng Wang, Dan Guo
TL;DR
This work tackles self-supervised heart rate estimation from unlabeled facial videos by proposing two complementary pipelines: (i) a non-end-to-end Spatial-Temporal Transformer that extracts rPPG cues from MSTmap representations and is trained with four PSD-based self-supervised losses (bandwidth, sparsity, variance, periodicity), and (ii) an end-to-end Contrastive Learning framework using ST-rPPG blocks and Contrast-Phys+-style objectives, followed by supervised fine-tuning. An ensemble of both solutions leverages their complementary strengths to maximize accuracy, achieving a RMSE of $8.85277$ and placing 2nd in Track 1 of the RePSS- IJCAI 2024 challenge. The study demonstrates that leveraging unlabeled data with carefully designed priors on rPPG spectral properties can yield robust HR estimates across varied conditions, with ablation results highlighting the value of periodicity priors and larger pretraining sets. The work also outlines practical implementation details, including MSTmap construction, spatial–temporal Transformer configuration, and contrastive pretraining strategies, contributing to the advancement of label-free remote HR estimation for real-world deployment.
Abstract
This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-supervised heart rate measurement in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge hosted at IJCAI 2024. The goal is to develop a self-supervised learning algorithm for heart rate (HR) estimation using unlabeled facial videos. To tackle this task, we present two self-supervised HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, respectively. Specifically, we first propose a non-end-to-end self-supervised HR measurement framework based on spatial-temporal modeling, which can effectively capture subtle rPPG clues and leverage the inherent bandwidth and periodicity characteristics of rPPG to constrain the model. Meanwhile, we employ an excellent end-to-end solution based on contrastive learning, aiming to generalize across different scenarios from complementary perspectives. Finally, we combine the strengths of the above solutions through an ensemble strategy to generate the final predictions, leading to a more accurate HR estimation. As a result, our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing \textbf{2nd place} in Track 1 of the challenge.
