Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

Mingxuan Liu; Jiankai Tang; Yongli Chen; Haoxiang Li; Jiahao Qi; Siwei Li; Kegang Wang; Jie Gan; Yuntao Wang; Hong Chen

Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

Mingxuan Liu, Jiankai Tang, Yongli Chen, Haoxiang Li, Jiahao Qi, Siwei Li, Kegang Wang, Jie Gan, Yuntao Wang, Hong Chen

TL;DR

This paper addresses the energy-intensity challenge of camera-based rPPG by introducing Spiking-PhysFormer, a hybrid ANN–SNN model that uses ANN patch embedding, parallel spike-driven transformer blocks, and an ANN predictor head. It combines a novel S3A mechanism with a parallelized attention pathway to achieve substantial energy savings (e.g., transformer-block energy reduced by ~12.2×) while preserving accuracy across four public datasets and demonstrating cross-dataset generalization. The method includes spike coding/decoding bridges, surrogate-gradient training, and interpretable spike-based attention maps that localize facial regions and pulse-wave peaks. The results show competitive performance relative to state-of-the-art ANN-based rPPG models with significantly lower power consumption, highlighting potential for energy-efficient edge deployment in remote-health monitoring and telemedicine, albeit with privacy and ethical considerations for camera-based physiological sensing.

Abstract

Artificial neural networks (ANNs) can help camera-based remote photoplethysmography (rPPG) in measuring cardiac activity and physiological signals from facial videos, such as pulse wave, heart rate and respiration rate with better accuracy. However, most existing ANN-based methods require substantial computing resources, which poses challenges for effective deployment on mobile devices. Spiking neural networks (SNNs), on the other hand, hold immense potential for energy-efficient deep learning owing to their binary and event-driven architecture. To the best of our knowledge, we are the first to introduce SNNs into the realm of rPPG, proposing a hybrid neural network (HNN) model, the Spiking-PhysFormer, aimed at reducing power consumption. Specifically, the proposed Spiking-PhyFormer consists of an ANN-based patch embedding block, SNN-based transformer blocks, and an ANN-based predictor head. First, to simplify the transformer block while preserving its capacity to aggregate local and global spatio-temporal features, we design a parallel spike transformer block to replace sequential sub-blocks. Additionally, we propose a simplified spiking self-attention mechanism that omits the value parameter without compromising the model's performance. Experiments conducted on four datasets-PURE, UBFC-rPPG, UBFC-Phys, and MMPD demonstrate that the proposed model achieves a 12.4\% reduction in power consumption compared to PhysFormer. Additionally, the power consumption of the transformer block is reduced by a factor of 12.2, while maintaining decent performance as PhysFormer and other ANN-based models.

Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

TL;DR

Abstract

Paper Structure (20 sections, 35 equations, 10 figures, 8 tables)

This paper contains 20 sections, 35 equations, 10 figures, 8 tables.

Introduction
Related work
Camera-based remote photoplethysmography
Transformer-based spiking neural networks
Hybrid neural networks
The proposed method
Overall architecture
Data initial preprocessing
Parallel spike-driven transformer
Experimental results
Datasets and performance metrics
Implementation details
Energy consumption analysis
Cross-dataset testing
Spatio-temporal attention map
...and 5 more sections

Figures (10)

Figure 1: (a) Human head anatomy with external and internal carotid arteries (headblood) (b) rPPG pipeline of neural methods (toolbox
Figure 2: MAE (lower is better) vs inference energy of different neural methods implemented in 45nm technology (45nm) with an input frame size of 128 × 128, the shaded blue region shows the preferred region (The FLOPs for ANN-based models are derived from PhysBench (physbench) and adjusted based on input size). (a) Results on the PURE dataset (pure) after training on the UBFC-rPPG dataset (ubfc-rppg). (b) Results on the UBFC-Phys dataset (ubfc-phys) after training on the PURE dataset (pure) (c) The computational energy required by the transformer block in the Spiking-PhysFormer is 12.2 times lower than that in the PhysFormer (physformer).
Figure 3: Framework of the Spiking-PhysFormer. It consists of an ANN-based patch embedding (PE) block, several parallel spike-driven transformer blocks, and an ANN-based predictor head. The icon above the arrow between the PE and the Parallel spike-driven transformer blocks represents direct encoding of the output from the PE block. For the ANN-based components of our model, we follow the network structure in PhysFormer (physformer). Additionally, we initialize our model by pretraining PhysFormer and extracting the weights of the PE block as pre-trained parameters.
Figure 4: Comparison temporal difference self-attention (TDSA) used in PhysFormer (physformer) and our simplified spiking self-attention (S3A). (a) In TDSA, $Q$, $K$, and $V$ are obtained through linear projections using TDC (tdc) and Conv3D. Since the input $X$ is a floating-point matrix, this involves a significant amount of multiplication operations. Furthermore, the subsequent SA operation involves matrix multiplication, specifically requiring $2N^2D$ multiply-and-accumulate operations, where $N$ is the number of tokens, $D$ is the channel dimensions. (b) Compared with TDSA, S3A utilizes TDC exclusively for query computation. Additionally, since the input $S$ is a binary spike sequence, the linear operation involved here is limited to addition. For SA computation, S3A employs an element-wise mask (Hadamard product), column summation, and column mask. As a result, only $fND$ accumulate operations are required, where $f$ represents the non-zero ratio of the matrix after applying the mask to $Q$ and $K$. Typically, $f$ is less than 0.06 (Fig. \ref{['fig8']}).
Figure 5: Example video frames from datasets. (a) PURE (pure); (b) UBFC-rPPG (ubfc-rppg); (c) UBFC-Phys (ubfc-phys); and (d) MMPD (mmpd).
...and 5 more figures

Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

TL;DR

Abstract

Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)