Table of Contents
Fetching ...

VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement

Jiachen Li, Shisheng Guo, Longzhen Tang, Cuolong Cui, Lingjiang Kong, Xiaobo Yang

TL;DR

VidFormer addresses the challenge of robust video-based rPPG by fusing 3D CNNs for local feature extraction with Transformers for global temporal-spatial modeling. It introduces a Stem, Local Convolution Branch (GA-3DCNN and BS-3DCNN), Global Transformer Branch (ST-MHSA and Transformer), a CNN-Transformer Interaction Module (CTIM), and an rPPG Generation Module (RGM). The approach relies on an enhanced skin-reflection model and a dual-path fusion strategy to reconstruct BVP signals, with a loss that combines negative Pearson correlation and Smooth L1. Extensive experiments on UBFC-rPPG, PURE, COHFACE, ECG-fitness, and DEAP show state-of-the-art HR, RF, and HRV performance, along with strong cross-dataset generalization and rigorous ablations. The work provides concrete insights into the roles of ethnicity, makeup, and physical activity on rPPG performance and demonstrates practical potential for non-contact vital sign monitoring across diverse conditions.

Abstract

Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.

VidFormer: A novel end-to-end framework fused by 3DCNN and Transformer for Video-based Remote Physiological Measurement

TL;DR

VidFormer addresses the challenge of robust video-based rPPG by fusing 3D CNNs for local feature extraction with Transformers for global temporal-spatial modeling. It introduces a Stem, Local Convolution Branch (GA-3DCNN and BS-3DCNN), Global Transformer Branch (ST-MHSA and Transformer), a CNN-Transformer Interaction Module (CTIM), and an rPPG Generation Module (RGM). The approach relies on an enhanced skin-reflection model and a dual-path fusion strategy to reconstruct BVP signals, with a loss that combines negative Pearson correlation and Smooth L1. Extensive experiments on UBFC-rPPG, PURE, COHFACE, ECG-fitness, and DEAP show state-of-the-art HR, RF, and HRV performance, along with strong cross-dataset generalization and rigorous ablations. The work provides concrete insights into the roles of ethnicity, makeup, and physical activity on rPPG performance and demonstrates practical potential for non-contact vital sign monitoring across diverse conditions.

Abstract

Remote physiological signal measurement based on facial videos, also known as remote photoplethysmography (rPPG), involves predicting changes in facial vascular blood flow from facial videos. While most deep learning-based methods have achieved good results, they often struggle to balance performance across small and large-scale datasets due to the inherent limitations of convolutional neural networks (CNNs) and Transformer. In this paper, we introduce VidFormer, a novel end-to-end framework that integrates 3-Dimension Convolutional Neural Network (3DCNN) and Transformer models for rPPG tasks. Initially, we conduct an analysis of the traditional skin reflection model and subsequently introduce an enhanced model for the reconstruction of rPPG signals. Based on this improved model, VidFormer utilizes 3DCNN and Transformer to extract local and global features from input data, respectively. To enhance the spatiotemporal feature extraction capabilities of VidFormer, we incorporate temporal-spatial attention mechanisms tailored for both 3DCNN and Transformer. Additionally, we design a module to facilitate information exchange and fusion between the 3DCNN and Transformer. Our evaluation on five publicly available datasets demonstrates that VidFormer outperforms current state-of-the-art (SOTA) methods. Finally, we discuss the essential roles of each VidFormer module and examine the effects of ethnicity, makeup, and exercise on its performance.
Paper Structure (40 sections, 16 equations, 19 figures, 12 tables)

This paper contains 40 sections, 16 equations, 19 figures, 12 tables.

Figures (19)

  • Figure 1: The potential mapping relationship between each frame in the video and the BVP signal.
  • Figure 2: The overall framework of VidFormer. VidFormer leverages 3DCNN and Transformers to extract local and global features from input facial videos, respectively, and facilitates the interaction and fusion of these features. Additionally, VidFormer is designed with a modular structure that enables the efficient formation of a deeper network.
  • Figure 3: The illustration of the GA-3DCNN module. GA-3DCNN incorporates a global attention mechanism to assist BS-3DCNN in focusing on critical spatiotemporal regions within the input video.
  • Figure 4: The schematic diagram of multi-head attention mechanism in Spatial Attention and Time Attention.
  • Figure 5: The illustration of BS-3DCNN. BS-3DCNN is designed for feature extraction from input data.
  • ...and 14 more figures