Table of Contents
Fetching ...

CalibrationPhys: Self-supervised Video-based Heart and Respiratory Rate Measurements by Calibrating Between Multiple Cameras

Yusuke Akamatsu, Terumi Umematsu, Hitoshi Imaoka

TL;DR

CalibrationPhys addresses the challenge of non-contact HR and RR measurement without ground-truth labels by leveraging self-supervised contrastive learning across synchronized videos from two cameras. It introduces camera-specific 2DCNNs with spatio-temporal representations for RGB-based HR and optical-flow-based RR, augmented by temporal augmentation and optional pre-training to improve robustness and generalization. The method achieves state-of-the-art or competitive performance on smartphone and webcam data, demonstrates cross-dataset transferability, and significantly reduces training label requirements. This work enables flexible deployment across arbitrary cameras and has practical implications for remote health monitoring and telemedicine where labeled data are scarce.

Abstract

Video-based heart and respiratory rate measurements using facial videos are more useful and user-friendly than traditional contact-based sensors. However, most of the current deep learning approaches require ground-truth pulse and respiratory waves for model training, which are expensive to collect. In this paper, we propose CalibrationPhys, a self-supervised video-based heart and respiratory rate measurement method that calibrates between multiple cameras. CalibrationPhys trains deep learning models without supervised labels by using facial videos captured simultaneously by multiple cameras. Contrastive learning is performed so that the pulse and respiratory waves predicted from the synchronized videos using multiple cameras are positive and those from different videos are negative. CalibrationPhys also improves the robustness of the models by means of a data augmentation technique and successfully leverages a pre-trained model for a particular camera. Experimental results utilizing two datasets demonstrate that CalibrationPhys outperforms state-of-the-art heart and respiratory rate measurement methods. Since we optimize camera-specific models using only videos from multiple cameras, our approach makes it easy to use arbitrary cameras for heart and respiratory rate measurements.

CalibrationPhys: Self-supervised Video-based Heart and Respiratory Rate Measurements by Calibrating Between Multiple Cameras

TL;DR

CalibrationPhys addresses the challenge of non-contact HR and RR measurement without ground-truth labels by leveraging self-supervised contrastive learning across synchronized videos from two cameras. It introduces camera-specific 2DCNNs with spatio-temporal representations for RGB-based HR and optical-flow-based RR, augmented by temporal augmentation and optional pre-training to improve robustness and generalization. The method achieves state-of-the-art or competitive performance on smartphone and webcam data, demonstrates cross-dataset transferability, and significantly reduces training label requirements. This work enables flexible deployment across arbitrary cameras and has practical implications for remote health monitoring and telemedicine where labeled data are scarce.

Abstract

Video-based heart and respiratory rate measurements using facial videos are more useful and user-friendly than traditional contact-based sensors. However, most of the current deep learning approaches require ground-truth pulse and respiratory waves for model training, which are expensive to collect. In this paper, we propose CalibrationPhys, a self-supervised video-based heart and respiratory rate measurement method that calibrates between multiple cameras. CalibrationPhys trains deep learning models without supervised labels by using facial videos captured simultaneously by multiple cameras. Contrastive learning is performed so that the pulse and respiratory waves predicted from the synchronized videos using multiple cameras are positive and those from different videos are negative. CalibrationPhys also improves the robustness of the models by means of a data augmentation technique and successfully leverages a pre-trained model for a particular camera. Experimental results utilizing two datasets demonstrate that CalibrationPhys outperforms state-of-the-art heart and respiratory rate measurement methods. Since we optimize camera-specific models using only videos from multiple cameras, our approach makes it easy to use arbitrary cameras for heart and respiratory rate measurements.
Paper Structure (22 sections, 2 equations, 7 figures, 9 tables)

This paper contains 22 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (i) CalibrationPhys performs contrastive learning using facial videos captured simultaneously by two cameras A and B (a webcam and a smartphone in this case) to train HR and RR estimation models for each camera. (ii) CalibrationPhys w/ Pre-train uses the pre-trained HR or RR estimation model for a particular camera (the webcam in this case). In contrastive learning, we fix the pre-trained model and train only the model for the newly applied camera (the smartphone in this case). (iii) During inference, we predict pulse and respiratory waves from the facial videos by using the trained model for each camera. In our experiments, we used a Logitech C920n HD PRO webcam (Camera A), a Google Pixel 5 smartphone (Camera B), the wearable sensor E4 wristband (PPG sensor), and the RIP respiratory sensor (Respiratory belt). Photos by Generated.Photos.
  • Figure 2: Training process of CalibrationPhys. The synchronized videos using multiple cameras are transformed into spatio-temporal representations after temporal augmentation, and the pulse or respiratory waves are predicted via a 2DCNN. We then perform contrastive learning so that HR or RR estimated from facial videos captured simultaneously by two cameras A and B are attracted, and their values estimated from different facial videos are repelled. Note that the models for HR and RR estimation are trained individually.
  • Figure 3: Extraction of RGB and optical flow signals from a facial video and construction of spatio-temporal representations.
  • Figure 4: Our model network for HR and RR measurements, which is a 2DCNN based on PhysNet yu2019remote. The input is a tensor of $C \times M \times T$ ($C$, $M$, and $T$ are the number of channels, ROIs, and frames, respectively), and the output shape is $1 \times 1 \times T$. "$5\times1$ Conv, 32" represents a convolution layer with a kernel size of $5\times1$ and 32 channels.
  • Figure 5: Mean absolute error (MAE) of HR (left) and RR (right) measurements between webcam and smartphone in the multi-camera dataset. In the HR measurement, CHROM de2013robust and CalibrationPhys are compared. In the RR measurement, Optical Flow and CalibrationPhys are compared.
  • ...and 2 more figures