Table of Contents
Fetching ...

PulmoFusion: Advancing Pulmonary Health with Efficient Multi-Modal Fusion

Ahmed Sharshar, Yasser Attia, Mohammad Yaqub, Mohsen Guizani

TL;DR

PulmoFusion tackles the challenge of precise remote spirometry by fusing RGB or thermal video data with patient metadata through energy-efficient Spiking Neural Networks and CNN-based backbones. The approach uses a Multi-Head Attention Layer to fuse video-derived spikes with metadata for both regression of PEF and classification/regression of FEV1/FVC, achieving state-of-the-art performance while emphasizing low-resource efficiency. It demonstrates strong thermal-imaging advantages and fast inference, supporting potential deployment in low-resource settings, though it acknowledges limitations from a small cohort and reliance on manually segmented breathing cycles. The work contributes a novel multimodal framework, a publicly available codebase, and a dataset to accelerate research in non-invasive pulmonary health monitoring.

Abstract

Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on https://github.com/ahmed-sharshar/RespiroDynamics.git

PulmoFusion: Advancing Pulmonary Health with Efficient Multi-Modal Fusion

TL;DR

PulmoFusion tackles the challenge of precise remote spirometry by fusing RGB or thermal video data with patient metadata through energy-efficient Spiking Neural Networks and CNN-based backbones. The approach uses a Multi-Head Attention Layer to fuse video-derived spikes with metadata for both regression of PEF and classification/regression of FEV1/FVC, achieving state-of-the-art performance while emphasizing low-resource efficiency. It demonstrates strong thermal-imaging advantages and fast inference, supporting potential deployment in low-resource settings, though it acknowledges limitations from a small cohort and reliance on manually segmented breathing cycles. The work contributes a novel multimodal framework, a publicly available codebase, and a dataset to accelerate research in non-invasive pulmonary health monitoring.

Abstract

Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on https://github.com/ahmed-sharshar/RespiroDynamics.git

Paper Structure

This paper contains 10 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: PulmoFusion SNN Architecture: Processes thermal/RGB videos and metadata by encoding them into spikes. Video spikes feed into a Spiking CNN, undergo max pooling, and pass through a fully connected (FC) layer. Features from videos and metadata's FC layer are concatenated and forwarded to the classifier, which classifies if the video is normal if FEV1/FVC $\geq$ 70% or abnormal if less.
  • Figure 2: PulmoFusion CNN Architecture: Processes thermal/RGB videos using X3D model. The model analyses the input videos as packets and output features. Features from the metadata are extracted using a fully FC layer. Then features from metadata and videos are concatenated using the attention layer and then the FC layer which is used as regression for PEF and FEV1/FVC values or classification.