Toward Motion Robustness: A masked attention regularization framework in remote photoplethysmography

Pengfei Zhao; Qigong Sun; Xiaolin Tian; Yige Yang; Shuo Tao; Jie Cheng; Jiantong Chen

Toward Motion Robustness: A masked attention regularization framework in remote photoplethysmography

Pengfei Zhao, Qigong Sun, Xiaolin Tian, Yige Yang, Shuo Tao, Jie Cheng, Jiantong Chen

TL;DR

This study tackles ROI localization and motion sensitivity in facial video-based rPPG by introducing MAR-rPPG, a framework that couples masked attention regularization with an Enhanced rPPG Expert Aggregation (EREA) backbone. The approach enforces spatial-temporal attention consistency (including flip semantic consistency) and uses masking to prevent overfitting to erroneous ROIs, while EREA allocates processing across facial regions to produce robust rPPG signals and attention maps. Two loss components—regression loss and attention-consistency loss—drive accurate HR estimation and stable attention maps, yielding strong cross-dataset generalization and motion-robust performance across PURE, UBFC-rPPG, and MMPD datasets. The method achieves near-perfect correlation on ideal datasets and maintains robustness under challenging real-world conditions, suggesting practical potential for non-contact vital signs monitoring with efficient preprocessing via MediaPipe.

Abstract

There has been growing interest in facial video-based remote photoplethysmography (rPPG) measurement recently, with a focus on assessing various vital signs such as heart rate and heart rate variability. Despite previous efforts on static datasets, their approaches have been hindered by inaccurate region of interest (ROI) localization and motion issues, and have shown limited generalization in real-world scenarios. To address these challenges, we propose a novel masked attention regularization (MAR-rPPG) framework that mitigates the impact of ROI localization and complex motion artifacts. Specifically, our approach first integrates a masked attention regularization mechanism into the rPPG field to capture the visual semantic consistency of facial clips, while it also employs a masking technique to prevent the model from overfitting on inaccurate ROIs and subsequently degrading its performance. Furthermore, we propose an enhanced rPPG expert aggregation (EREA) network as the backbone to obtain rPPG signals and attention maps simultaneously. Our EREA network is capable of discriminating divergent attentions from different facial areas and retaining the consistency of spatiotemporal attention maps. For motion robustness, a simple open source detector MediaPipe for data preprocessing is sufficient for our framework due to its superior capability of rPPG signal extraction and attention regularization. Exhaustive experiments on three benchmark datasets (UBFC-rPPG, PURE, and MMPD) substantiate the superiority of our proposed method, outperforming recent state-of-the-art works by a considerable margin.

Toward Motion Robustness: A masked attention regularization framework in remote photoplethysmography

TL;DR

Abstract

Paper Structure (19 sections, 6 equations, 6 figures, 3 tables)

This paper contains 19 sections, 6 equations, 6 figures, 3 tables.

Introduction
Related works
Methodology
Masked attention regularization
Enhanced rPPG expert aggregation network
Network optimization
Regression loss
Attention consistency loss
Experiments
Datasets
Implementation Details
Metrics and evaluation
Results
HR evaluation
Cross dataset evaluation
...and 4 more sections

Figures (6)

Figure 1: The model degradation due to inaccurate and inconsistent ROI localizations on PURE dataset under the flip senmatic consistency strategy. 'n' and 'f' in (a) mean the normal data and horizontally flipped data, respectively. i.e., 'n/f' denotes that the model trained with normal input and tested on flipped data. (b) shows the inconsistent attention map samples which are supposed to act as mirror attention regions.
Figure 2: The overview of our proposed MAR-rPPG. The MAR-rPPG consists of one Encoder and one EREA network with shared weights for two inputs. First, the encoder encodes the input video into a feature tensor. Next, this feature tensor is sent to the EREA network and EREA outputs a rPPG signal and attention maps. Specifically in the EREA, the feature tensor is divided into four equal parts, and each part is pass to its corresponding Expert $E$ module to generate attention maps and extract a rPPG signal corresponding to one of facial regions. Finally, a Gate module $G$ aggregates four different rPPG signals into one rPPG prediction.
Figure 3: The Bland-Altman plot (a) and scatter plot (b) show the difference between estimated HR and ground truth HR on the crossdataset evaluation (PURE $\rightarrow$ UBFC-rPPG).
Figure 4: The Bland-Altman plot (a) and scatter plot (b) show the difference between estimated HR and ground truth HR on the crossdataset evaluation (UBFC-rPPG $\rightarrow$ PURE).
Figure 5: The estimated rPPG signals on talking and walking motion samples of MMPD dataset.
...and 1 more figures

Toward Motion Robustness: A masked attention regularization framework in remote photoplethysmography

TL;DR

Abstract

Toward Motion Robustness: A masked attention regularization framework in remote photoplethysmography

Authors

TL;DR

Abstract

Table of Contents

Figures (6)