Multi-scale Masked Autoencoder for Electrocardiogram Anomaly Detection
Ya Zhou, Yujie Yang, Jianhuang Gan, Xiangjie Li, Jing Yuan, Wei Zhao
TL;DR
This work tackles the challenge of ECG anomaly detection without reliance on R-peak segmentation, introducing MMAE-ECG, a lightweight Transformer-based framework that uses multi-scale masking and distinct local/global positional embeddings to capture both global rhythms and local morphologies. The method jointly performs end-to-end anomaly detection and localization via a multi-scale masking, cross-attention encoding, and a one-layer reconstruction decoder, with an aggregation strategy that yields robust sample- and point-level scores. On PTB-XL, MMAE-ECG achieves competitive detection and localization performance while dramatically reducing inference FLOPs (approx. 1/78 of the previous state-of-the-art) and model size, demonstrating strong potential for practical, resource-constrained clinical deployment. Ablation studies validate the contribution of each component, highlighting the value of multi-scale representations and targeted loss design in ECG anomaly tasks and suggesting extensions to broader ECG analysis beyond anomaly detection.
Abstract
Electrocardiogram (ECG) analysis is a fundamental tool for diagnosing cardiovascular conditions, yet anomaly detection in ECG signals remains challenging due to their inherent complexity and variability. We propose Multi-scale Masked Autoencoder for ECG anomaly detection (MMAE-ECG), a novel end-to-end framework that effectively captures both global and local dependencies in ECG data. Unlike state-of-the-art methods that rely on heartbeat segmentation or R-peak detection, MMAE-ECG eliminates the need for such pre-processing steps, enhancing its suitability for clinical deployment. MMAE-ECG partitions ECG signals into non-overlapping segments, with each segment assigned learnable positional embeddings. A novel multi-scale masking strategy and multi-scale attention mechanism, along with distinct positional embeddings, enable a lightweight Transformer encoder to effectively capture both local and global dependencies. The masked segments are then reconstructed using a single-layer Transformer block, with an aggregation strategy employed during inference to refine the outputs. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art approaches while significantly reducing computational complexity-approximately 1/78 of the floating-point operations (FLOPs) required for inference. Ablation studies further validate the effectiveness of each component, highlighting the potential of multi-scale masked autoencoders for anomaly detection.
