Table of Contents
Fetching ...

CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement

Zheng Wu, Yiping Xie, Bo Zhao, Jiguang He, Fei Luo, Ning Deng, Zitong Yu

TL;DR

CardiacMamba tackles non-contact heart-rate estimation by fusing RGB video and RF radar signals through a state-space–informed multimodal framework. The method introduces the Temporal Difference Mamba Module (TDMM), Bidirectional State Space Model (Bi-SSM), and Channel-wise Fast Fourier Transform (CFFT) to extract dynamic temporal features, align modalities bidirectionally, and refine frequency-domain cues for heart-rate periodicity. On EquiPleth, CardiacMamba achieves state-of-the-art accuracy and robustness, reduces skin-tone bias, and remains effective under missing-modality conditions, demonstrating strong potential for fair, real-world healthcare deployment. By integrating Mamba-based cross-modal modeling with frequency-domain fusion, the approach advances rPPG technology toward reliable, scalable remote monitoring.

Abstract

Heart rate (HR) estimation via remote photoplethysmography (rPPG) offers a non-invasive solution for health monitoring. However, traditional single-modality approaches (RGB or Radio Frequency (RF)) face challenges in balancing robustness and accuracy due to lighting variations, motion artifacts, and skin tone bias. In this paper, we propose CardiacMamba, a multimodal RGB-RF fusion framework that leverages the complementary strengths of both modalities. It introduces the Temporal Difference Mamba Module (TDMM) to capture dynamic changes in RF signals using timing differences between frames, enhancing the extraction of local and global features. Additionally, CardiacMamba employs a Bidirectional SSM for cross-modal alignment and a Channel-wise Fast Fourier Transform (CFFT) to effectively capture and refine the frequency domain characteristics of RGB and RF signals, ultimately improving heart rate estimation accuracy and periodicity detection. Extensive experiments on the EquiPleth dataset demonstrate state-of-the-art performance, achieving marked improvements in accuracy and robustness. CardiacMamba significantly mitigates skin tone bias, reducing performance disparities across demographic groups, and maintains resilience under missing-modality scenarios. By addressing critical challenges in fairness, adaptability, and precision, the framework advances rPPG technology toward reliable real-world deployment in healthcare. The codes are available at: https://github.com/WuZheng42/CardiacMamba.

CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement

TL;DR

CardiacMamba tackles non-contact heart-rate estimation by fusing RGB video and RF radar signals through a state-space–informed multimodal framework. The method introduces the Temporal Difference Mamba Module (TDMM), Bidirectional State Space Model (Bi-SSM), and Channel-wise Fast Fourier Transform (CFFT) to extract dynamic temporal features, align modalities bidirectionally, and refine frequency-domain cues for heart-rate periodicity. On EquiPleth, CardiacMamba achieves state-of-the-art accuracy and robustness, reduces skin-tone bias, and remains effective under missing-modality conditions, demonstrating strong potential for fair, real-world healthcare deployment. By integrating Mamba-based cross-modal modeling with frequency-domain fusion, the approach advances rPPG technology toward reliable, scalable remote monitoring.

Abstract

Heart rate (HR) estimation via remote photoplethysmography (rPPG) offers a non-invasive solution for health monitoring. However, traditional single-modality approaches (RGB or Radio Frequency (RF)) face challenges in balancing robustness and accuracy due to lighting variations, motion artifacts, and skin tone bias. In this paper, we propose CardiacMamba, a multimodal RGB-RF fusion framework that leverages the complementary strengths of both modalities. It introduces the Temporal Difference Mamba Module (TDMM) to capture dynamic changes in RF signals using timing differences between frames, enhancing the extraction of local and global features. Additionally, CardiacMamba employs a Bidirectional SSM for cross-modal alignment and a Channel-wise Fast Fourier Transform (CFFT) to effectively capture and refine the frequency domain characteristics of RGB and RF signals, ultimately improving heart rate estimation accuracy and periodicity detection. Extensive experiments on the EquiPleth dataset demonstrate state-of-the-art performance, achieving marked improvements in accuracy and robustness. CardiacMamba significantly mitigates skin tone bias, reducing performance disparities across demographic groups, and maintains resilience under missing-modality scenarios. By addressing critical challenges in fairness, adaptability, and precision, the framework advances rPPG technology toward reliable real-world deployment in healthcare. The codes are available at: https://github.com/WuZheng42/CardiacMamba.

Paper Structure

This paper contains 27 sections, 23 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of deep learning methods for rPPG learning. (a) RGB-only Method: Training with only RGB data collected by the camera. (b) RF-only Method: Training with only RF data collected by the radar. (c) Training with both RGB and RF data.
  • Figure 2: The overall architecture of CardiacMamba. It consists of three stages: Dual-level Feature Extraction and Alignment, Bidirectional Feature Interaction, and Bidirectional Feature Fusion.
  • Figure 3: Time difference Mamba Module (TDMM) for extracting RF dynamic timing features and global features.
  • Figure 4: Channel-wise Fast Fourier Transform (CFFT) is used to extract frequency domain features of RGB and RF modalities.
  • Figure 5: Visual representations of human face and radar spectrum features. (a) The human face image used in the analysis. (b) Feature heat map of the human face, illustrating the key regions of interest. (c) Radar spectrum diagram representing the frequency information of the signal. (d) Feature heat map of the radar spectrum diagram, highlighting the relevant features for analysis.
  • ...and 3 more figures