Table of Contents
Fetching ...

Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

Zijie Yue, Miaojing Shi, Hanli Wang, Shuai Ding, Qijun Chen, Shanlin Yang

TL;DR

This work addresses non-contact physiological measurement from facial videos by learning to extract rhythm-based cues without ground-truth PPG signals. It proposes VL-phys, a frequency-centric self-supervised framework that bootstraps a pre-trained vision-language model by generating frequency-aware vision-text pairs via STMaps and text prompts, and optimizes with a combination of generative and contrastive losses. The method achieves state-of-the-art performance among self-supervised approaches and rivals supervised methods across four benchmarks, with strong cross-dataset generalization and robustness to makeup. The findings highlight the value of text-guided frequency reasoning and STMap-based representations, suggesting broader applicability to other frequency-driven vision-language tasks.

Abstract

Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.

Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

TL;DR

This work addresses non-contact physiological measurement from facial videos by learning to extract rhythm-based cues without ground-truth PPG signals. It proposes VL-phys, a frequency-centric self-supervised framework that bootstraps a pre-trained vision-language model by generating frequency-aware vision-text pairs via STMaps and text prompts, and optimizes with a combination of generative and contrastive losses. The method achieves state-of-the-art performance among self-supervised approaches and rivals supervised methods across four benchmarks, with strong cross-dataset generalization and robustness to makeup. The findings highlight the value of text-guided frequency reasoning and STMap-based representations, suggesting broader applicability to other frequency-driven vision-language tasks.

Abstract

Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual map reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods.
Paper Structure (39 sections, 5 equations, 10 figures, 13 tables)

This paper contains 39 sections, 5 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Existing vision-language models (VLMs) are pre-trained on diverse vision-text pairs, including those that describe the temporal variations of certain object attributes. We for the first time adapt VLMs with the ability to digest the frequency-related knowledge of skin color temporal variation in vision and text modalities for self-supervised remote physiological measurement.
  • Figure 2: Overall architecture of our VL-phys. Given an input video, we respectively apply spatial augmentation and learnable frequency augmentation (LFA) to obtain its positive and negative video samples. We generate their spatio-temporal maps (STMaps) and then create contrastive spatio-temporal maps (C-STMaps) to reflect the frequency ratios of skin color temporal variations between positive and negative samples; meanwhile, we carefully craft text prompts to describe such relations. Afterwards, we fine-tune the pre-trained vision and text encoders of VLM with these formed vision-text pairs via frequency-related multimodal generative and contrastive tasks, i.e. the text-guided visual reconstruction task and vision-text contrastive learning task. Moreover, we introduce the unimodal frequency contrastive loss and the frequency ranking loss to optimize the rPPG signals estimated from different video samples.
  • Figure 3: The process of the generation of frequency-oriented vision-text pair and masked contrastive spatio-temporal map (M-STMap).
  • Figure 4: The structure of text-guided visual reconstruction (TVR) module.
  • Figure 5: Six examples for the visual comparison between estimated rPPG signals (red curves) and their corresponding ground truth PPG signals (blue curves).
  • ...and 5 more figures