Transformer-Based Sleep Stage Classification Enhanced by Clinical Information
Woosuk Chung, Seokwoo Hong, Wonhyeok Lee, Sangyoon Bae
TL;DR
The paper tackles automated sleep stage classification from polysomnography by addressing inter-scorer variability and the lack of contextual cues. It introduces a two-stage architecture: a Transformer-based per-epoch encoder and a 1D CNN aggregator, with late fusion of subject-level clinical metadata and per-epoch expert event annotations. On the SHHS dataset (n=8,357), context fusion yields macro-F1 improvements from 0.7745 to 0.8031 and micro-F1 from 0.8774 to 0.9051, with per-epoch events contributing the largest gains; a multi-task approach offers little additional benefit. This work demonstrates that integrating clinically meaningful cues with deep representations enhances both accuracy and interpretability, paving the way for context-aware, expert-aligned sleep staging systems.
Abstract
Manual sleep staging from polysomnography (PSG) is labor-intensive and prone to inter-scorer variability. While recent deep learning models have advanced automated staging, most rely solely on raw PSG signals and neglect contextual cues used by human experts. We propose a two-stage architecture that combines a Transformer-based per-epoch encoder with a 1D CNN aggregator, and systematically investigates the effect of incorporating explicit context: subject-level clinical metadata (age, sex, BMI) and per-epoch expert event annotations (apneas, desaturations, arousals, periodic breathing). Using the Sleep Heart Health Study (SHHS) cohort (n=8,357), we demonstrate that contextual fusion substantially improves staging accuracy. Compared to a PSG-only baseline (macro-F1 0.7745, micro-F1 0.8774), our final model achieves macro-F1 0.8031 and micro-F1 0.9051, with event annotations contributing the largest gains. Notably, feature fusion outperforms multi-task alternatives that predict the same auxiliary labels. These results highlight that augmenting learned representations with clinically meaningful features enhances both performance and interpretability, without modifying the PSG montage or requiring additional sensors. Our findings support a practical and scalable path toward context-aware, expert-aligned sleep staging systems.
