Table of Contents
Fetching ...

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

Bo Zhao, Dan Guo, Junzhe Cao, Yong Xu, Tao Tan, Yue Sun, Bochao Zou, Jie Zhang, Zitong Yu

TL;DR

PHASE-Net tackles the robustness and interpretability gap in remote photoplethysmography (rPPG) by deriving a physics-based dynamical model from hemodynamics and proving its equivalence to a causal convolution. This leads to a principled Temporal Convolutional Network (TCN) core, supported by a Zero-FLOPs Axial Swapper (ZAS) and an Adaptive Spatial Filter (ASF) for efficient, noise-robust signal extraction. The model achieves state-of-the-art accuracy with only ~0.29M parameters and strong cross-domain generalization across UBFC, PURE, BUAA, and MMPD, validated through extensive intra-/inter-dataset experiments and ablations. The work provides a deployable, interpretable framework with potential extensions to multi-task physiological sensing and other video-based biomedical applications.

Abstract

Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, which limits robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier-Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution. This provides a theoretical justification for using a Temporal Convolutional Network (TCN). Based on this principle, we design PHASE-Net, a lightweight model with three key components: (1) Zero-FLOPs Axial Swapper module, which swaps or transposes a few spatial channels to mix distant facial regions and enhance cross-region feature interaction without breaking temporal order; (2) Adaptive Spatial Filter, which learns a soft spatial mask per frame to highlight signal-rich areas and suppress noise; and (3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance with strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

PHASE-Net: Physics-Grounded Harmonic Attention System for Efficient Remote Photoplethysmography Measurement

TL;DR

PHASE-Net tackles the robustness and interpretability gap in remote photoplethysmography (rPPG) by deriving a physics-based dynamical model from hemodynamics and proving its equivalence to a causal convolution. This leads to a principled Temporal Convolutional Network (TCN) core, supported by a Zero-FLOPs Axial Swapper (ZAS) and an Adaptive Spatial Filter (ASF) for efficient, noise-robust signal extraction. The model achieves state-of-the-art accuracy with only ~0.29M parameters and strong cross-domain generalization across UBFC, PURE, BUAA, and MMPD, validated through extensive intra-/inter-dataset experiments and ablations. The work provides a deployable, interpretable framework with potential extensions to multi-task physiological sensing and other video-based biomedical applications.

Abstract

Remote photoplethysmography (rPPG) measurement enables non-contact physiological monitoring but suffers from accuracy degradation under head motion and illumination changes. Existing deep learning methods are mostly heuristic and lack theoretical grounding, which limits robustness and interpretability. In this work, we propose a physics-informed rPPG paradigm derived from the Navier-Stokes equations of hemodynamics, showing that the pulse signal follows a second-order dynamical system whose discrete solution naturally leads to a causal convolution. This provides a theoretical justification for using a Temporal Convolutional Network (TCN). Based on this principle, we design PHASE-Net, a lightweight model with three key components: (1) Zero-FLOPs Axial Swapper module, which swaps or transposes a few spatial channels to mix distant facial regions and enhance cross-region feature interaction without breaking temporal order; (2) Adaptive Spatial Filter, which learns a soft spatial mask per frame to highlight signal-rich areas and suppress noise; and (3) Gated TCN, a causal dilated TCN with gating that models long-range temporal dynamics for accurate pulse recovery. Extensive experiments demonstrate that PHASE-Net achieves state-of-the-art performance with strong efficiency, offering a theoretically grounded and deployment-ready rPPG solution.

Paper Structure

This paper contains 45 sections, 6 theorems, 67 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1

The solution $z_t$ of the LTI system in Eq. eq:ssm_final can be expressed as a causal convolution of all past inputs:

Figures (7)

  • Figure 1: An overview of the PHASE-Net. The Vision Encoder comprises three Efficient Spatio-Temporal Blocks extracting spatio–temporal features from video inputs. These are fed into an Adaptive Spatial Filter module that computes filtered features via convolution layers and differential operations. The temporally refined features are then processed by a GTCN block, which uses dual-path Temporal Convolutional Networks with tanh and sigmoid gates for fusion. Also shown are the inner contents of ESTBlock (Efficient Spatio-Temporal Block) including ZAS (Zero-FLOPs Axial Swapper) that swaps spatial/temporal axes without adding FLOP.
  • Figure 2: Comparison of different ablation studies.
  • Figure 3: Ablation over ZAS block sizes $b$.
  • Figure 4: Ablation over ZAS channel groups $p_c$.
  • Figure 5: MAE (bpm) of PHASE-Net, PhysNet, and RhythmFormer under four lighting conditions: LED-Low, LED-High, Incandescent, and Nature. Lower is better.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Proposition 1: Equivalence to Causal Convolution
  • Proposition 2: FIR Approximation
  • Proposition 3: Self-inversion
  • Proposition 4: Energy preservation and 1-Lipschitz
  • Proposition 5: Equivalence to Causal Convolution
  • proof
  • Proposition 6: FIR Approximation
  • proof