Table of Contents
Fetching ...

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li

TL;DR

PhysioWave introduces a learnable wavelet front-end paired with a Transformer backbone to capture multi-scale time-frequency structures in physiological signals. Key innovations include Adaptive Wavelet Selector, Frequency-guided Masking, and Cross-Scale CAFFN, enabling robust single- and multi-modal biosignal representations and efficient linear-probing fusion for EEG/EMG/ECG tasks. Large-scale pretraining on ECG and EMG data yields state-of-the-art results across downstream ECG/EMG benchmarks and boosts multi-modal emotion and driving-behavior tasks when fused with EEG encoders. The framework shows strong generalization across modalities (ECG, EMG, EEG, PPG) and hardware scales, highlighting practical potential for wearable health monitoring and clinical diagnostics while maintaining interpretability via GradCAM-like analyses of learned features.

Abstract

Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications. Code and data are available at: github.com/ForeverBlue816/PhysioWave

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

TL;DR

PhysioWave introduces a learnable wavelet front-end paired with a Transformer backbone to capture multi-scale time-frequency structures in physiological signals. Key innovations include Adaptive Wavelet Selector, Frequency-guided Masking, and Cross-Scale CAFFN, enabling robust single- and multi-modal biosignal representations and efficient linear-probing fusion for EEG/EMG/ECG tasks. Large-scale pretraining on ECG and EMG data yields state-of-the-art results across downstream ECG/EMG benchmarks and boosts multi-modal emotion and driving-behavior tasks when fused with EEG encoders. The framework shows strong generalization across modalities (ECG, EMG, EEG, PPG) and hardware scales, highlighting practical potential for wearable health monitoring and clinical diagnostics while maintaining interpretability via GradCAM-like analyses of learned features.

Abstract

Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications. Code and data are available at: github.com/ForeverBlue816/PhysioWave

Paper Structure

This paper contains 87 sections, 15 equations, 17 figures, 17 tables, 1 algorithm.

Figures (17)

  • Figure 1: Model pretraining pipeline. The pipeline begins by initializing a set of standard wavelet functions (e.g., 'db6', 'sym4'), from which learnable low-pass and high-pass filters are generated. These filters are then used for wavelet decomposition to obtain multi-scale frequency-band representations. The decomposed features are processed into spatio-temporal patches, with importance scores computed using FFT-based spectral energy. High-scoring patches are masked and passed through Transformer layers, followed by a lightweight decoder for patch reconstruction.
  • Figure 2: Analysis and soft gating process. The learnable wavelet front-end performs multi-resolution analysis by filtering and downsampling the input signal at each stage, preserving both low- and high-frequency components. At the first level ($\ell = 0$), the signal is decomposed into low-pass and high-pass components. This process recurses for $\ell = 1, \dots, L - 1$, applying downsampling at each level. After decomposition, the subbands are upsampled to the original resolution, and an adaptive gate $G^{(\ell)}_c \in [0, 1]$ is learned for each channel using multi-head attention. The gate dynamically combines the original and upsampled signals, facilitating fine-scale detail insertion.
  • Figure 3: Cross-Scale CAFFN: This module refines multi-resolution features using convolution, channel aggregation, and self-attention.
  • Figure 4: Multi-modal framework: Classification of driving behaviors in the MPDB dataset.
  • Figure 5: Multimodal classification performance.
  • ...and 12 more figures