Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning
Mohamed Khalil Ben Salah, Philippe Jouvet, Rita Noumeir
TL;DR
This work tackles non-contact vital-sign monitoring in Pediatric Intensive Care Units by developing a self-supervised rPPG framework tailored to clinical challenges. It integrates a VisionMamba-based student, a fixed physiological teacher (PhysMamba), and a novel Adaptive Masking Network that learns to occlude informative patches via policy gradient, guided by a curriculum that progresses from clean public data to real PICU videos. The approach combines masked reconstruction with physiological distillation, achieving a final MAE of $3.2$ bpm and a Pearson correlation of $R=0.91$, while demonstrating strong robustness to occlusions and domain shifts. The method promises practical impact by enabling continuous, low-risk, contactless heart-rate monitoring in the PICU with real-time efficiency and without explicit ROI annotation.
Abstract
Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.
