A Deep Generative Model for Five-Class Sleep Staging with Arbitrary Sensor Input

Hans van Gorp; Merel M. van Gilst; Pedro Fonseca; Fokke B. van Meulen; Johannes P. van Dijk; Sebastiaan Overeem; Ruud J. G. van Sloun

A Deep Generative Model for Five-Class Sleep Staging with Arbitrary Sensor Input

Hans van Gorp, Merel M. van Gilst, Pedro Fonseca, Fokke B. van Meulen, Johannes P. van Dijk, Sebastiaan Overeem, Ruud J. G. van Sloun

TL;DR

This work introduces a deep generative sleep-staging framework, Factorized Score-based Diffusion Modeling (FSDM), that supports automatic five-class sleep staging from arbitrary combinations of input signals. By factorizing the posterior score into a global prior and sensor-specific likelihood-minus-prior terms, and training per-sensor score networks, the method enables zero-shot inference on unseen sensor sets while remaining scalable to new modalities. The approach achieves human-interrater-level performance on standard EEG inputs (≈85% accuracy, κ≈0.79) and demonstrates competitive results with cardio-respiratory and unconventional signals, with a per-sensor information-gain metric strongly correlating with accuracy (r≈0.91). The framework offers a flexible, robust path toward universal sleep staging across diverse clinical and home monitoring scenarios, and it supports post-hoc addition of new sensors without retraining on all inputs.

Abstract

Gold-standard sleep scoring is based on epoch-based assignment of sleep stages based on a combination of EEG, EOG and EMG signals. However, a polysomnographic recording consists of many other signals that could be used for sleep staging, including cardio-respiratory modalities. Leveraging this signal variety would offer important advantages, for example increasing reliability, resilience to signal loss, and application to long-term non-obtrusive recordings. We developed a deep generative model for automatic sleep staging from a plurality of sensors and any -- arbitrary -- combination thereof. We trained a score-based diffusion model using a dataset of 1947 expert-labelled overnight recordings with 36 different signals, and achieved zero-shot inference on any sensor set by leveraging a novel Bayesian factorization of the score function across the sensors. On single-channel EEG, the model reaches the performance limit in terms of polysomnography inter-rater agreement (5-class accuracy 85.6%, Cohen's kappa 0.791). Moreover, the method offers full flexibility to use any sensor set, for example finger photoplethysmography, nasal flow and thoracic respiratory movements, (5-class accuracy 79.0%, Cohen's kappa of 0.697), or even derivations very unconventional for sleep staging, such as tibialis and sternocleidomastoid EMG (5-class accuracy 71.0%, kappa 0.575). Additionally, we propose a novel interpretability metric in terms of information gain per sensor and show this is linearly correlated with classification performance. Finally, our model allows for post-hoc addition of entirely new sensor modalities by merely training a score estimator on the novel input instead of having to retrain from scratch on all inputs.

A Deep Generative Model for Five-Class Sleep Staging with Arbitrary Sensor Input

TL;DR

Abstract

Paper Structure (16 sections, 22 equations, 6 figures, 4 tables)

This paper contains 16 sections, 22 equations, 6 figures, 4 tables.

Introduction
Methods
Factorized score-based diffusion modeling
Factorized posterior score
Score-based diffusion modeling
Learning the individual conditional scores
Learning the prior scores
Sampling from an FSDM
Information
The SOMNIA and HealthBed datasets
Signal extraction
Neural network architecture
Metrics
Additional comparison on Sleep-EDF expanded
Results
...and 1 more sections

Figures (6)

Figure 1: Visualization of the sampling process for an FSDM model. (A) From a current point $\bm{{y}}_m$ we estimate two likelihoods, two priors, and one global prior. Combining them all leads to a denoised estimate outside the hypnodensity manifold, which is corrected using a projection step $\tau()$. (B) Evolution of a sample over the last three time-steps. The end-estimate progressively moves from the hypnodensity manifold to the one-hot shell.
Figure 2: Qualitative examples of using five different signal combinations on a healthy subject (left column), and on a subject with narcolepsy type 1 (right column). The 5-class accuracy on each recording is listed between brackets. The red bars denote REM sleep.
Figure 3: Bland-Altman plots for the overnight sleep statistics as predicted by the recommended PSG setup over all recordings in the hold-out test set. The limits of agreement are given at the 95% confidence interval. A positive y value indicates an overestimation by our model with respect to the gold-standard, while a negative value indicates an underestimation.
Figure 4: Bland-Altman plots for four combinations of sleep statistics, disorders, and input signals. Limits of agreement are at the 95% confidence interval. From left to right: total sleep time for OSA (HSAT: Nasal Cannula + finger PPG + Thoracic Belt), WASO [min] for insomnia (finger PPG), REM onset latency [min] for narcolepsy (recommended PSG setup), and time in REM [min] for RBD (single channel EEG: F4-M1).
Figure 5: Average information gain, i.e. the average difference between likelihood and prior terms, versus 5-class accuracy. Left: Linear correlation between information gain per sensor and accuracy, with highlighted sensor positions. Right: Impact of reducing ECG signal quality (removing segments or adding noise) on accuracy and information gain, with text boxes showing the percentage of recording removed or SNR of added noise. The linear fit from the left plot still provides a good fit. SCM: sternocleidomastoid, IBR: instantaneous breathing rate, SSN: suprasternal notch.
...and 1 more figures

A Deep Generative Model for Five-Class Sleep Staging with Arbitrary Sensor Input

TL;DR

Abstract

A Deep Generative Model for Five-Class Sleep Staging with Arbitrary Sensor Input

Authors

TL;DR

Abstract

Table of Contents

Figures (6)