NAP: Attention-Based Late Fusion for Automatic Sleep Staging
Alvise Dei Rossi, Julia van der Meer, Markus H. Schmidt, Claudio L. A. Bassetti, Luigi Fiorillo, Francesca Faraci
TL;DR
The paper tackles the heterogeneity of polysomnography data across modalities, channels, and acquisition protocols by introducing NAP, a late-fusion meta-model that aggregates predictions from frozen single-channel predictors using a tri-axial attention mechanism to capture temporal, spatial, and predictor-level dependencies. NAP integrates diverse inputs with dimension-adaptive training and a modality-aware fusion to produce robust epoch-wise sleep-stage estimates, achieving state-of-the-art zero-shot generalization on several out-of-domain datasets. Key findings show that NAP outperforms both the averaging-based SOMNUS ensemble and individual predictors, with notable MF1 improvements on challenging stages such as N1 and, in some cases, Wake, across multiple unseen cohorts. The approach is modular and extendable to other multimodal physiological tasks, enabling principled fusion of heterogeneous predictive streams beyond sleep staging.
Abstract
Polysomnography signals are highly heterogeneous, varying in modality composition (e.g., EEG, EOG, ECG), channel availability (e.g., frontal, occipital EEG), and acquisition protocols across datasets and clinical sites. Most existing models that process polysomnography data rely on a fixed subset of modalities or channels and therefore neglect to fully exploit its inherently multimodal nature. We address this limitation by introducing NAP (Neural Aggregator of Predictions), an attention-based model which learns to combine multiple prediction streams using a tri-axial attention mechanism that captures temporal, spatial, and predictor-level dependencies. NAP is trained to adapt to different input dimensions. By aggregating outputs from frozen, pretrained single-channel models, NAP consistently outperforms individual predictors and simple ensembles, achieving state-of-the-art zero-shot generalization across multiple datasets. While demonstrated in the context of automated sleep staging from polysomnography, the proposed approach could be extended to other multimodal physiological applications.
