Table of Contents
Fetching ...

Transformer-Based Sleep Stage Classification Enhanced by Clinical Information

Woosuk Chung, Seokwoo Hong, Wonhyeok Lee, Sangyoon Bae

TL;DR

The paper tackles automated sleep stage classification from polysomnography by addressing inter-scorer variability and the lack of contextual cues. It introduces a two-stage architecture: a Transformer-based per-epoch encoder and a 1D CNN aggregator, with late fusion of subject-level clinical metadata and per-epoch expert event annotations. On the SHHS dataset (n=8,357), context fusion yields macro-F1 improvements from 0.7745 to 0.8031 and micro-F1 from 0.8774 to 0.9051, with per-epoch events contributing the largest gains; a multi-task approach offers little additional benefit. This work demonstrates that integrating clinically meaningful cues with deep representations enhances both accuracy and interpretability, paving the way for context-aware, expert-aligned sleep staging systems.

Abstract

Manual sleep staging from polysomnography (PSG) is labor-intensive and prone to inter-scorer variability. While recent deep learning models have advanced automated staging, most rely solely on raw PSG signals and neglect contextual cues used by human experts. We propose a two-stage architecture that combines a Transformer-based per-epoch encoder with a 1D CNN aggregator, and systematically investigates the effect of incorporating explicit context: subject-level clinical metadata (age, sex, BMI) and per-epoch expert event annotations (apneas, desaturations, arousals, periodic breathing). Using the Sleep Heart Health Study (SHHS) cohort (n=8,357), we demonstrate that contextual fusion substantially improves staging accuracy. Compared to a PSG-only baseline (macro-F1 0.7745, micro-F1 0.8774), our final model achieves macro-F1 0.8031 and micro-F1 0.9051, with event annotations contributing the largest gains. Notably, feature fusion outperforms multi-task alternatives that predict the same auxiliary labels. These results highlight that augmenting learned representations with clinically meaningful features enhances both performance and interpretability, without modifying the PSG montage or requiring additional sensors. Our findings support a practical and scalable path toward context-aware, expert-aligned sleep staging systems.

Transformer-Based Sleep Stage Classification Enhanced by Clinical Information

TL;DR

The paper tackles automated sleep stage classification from polysomnography by addressing inter-scorer variability and the lack of contextual cues. It introduces a two-stage architecture: a Transformer-based per-epoch encoder and a 1D CNN aggregator, with late fusion of subject-level clinical metadata and per-epoch expert event annotations. On the SHHS dataset (n=8,357), context fusion yields macro-F1 improvements from 0.7745 to 0.8031 and micro-F1 from 0.8774 to 0.9051, with per-epoch events contributing the largest gains; a multi-task approach offers little additional benefit. This work demonstrates that integrating clinically meaningful cues with deep representations enhances both accuracy and interpretability, paving the way for context-aware, expert-aligned sleep staging systems.

Abstract

Manual sleep staging from polysomnography (PSG) is labor-intensive and prone to inter-scorer variability. While recent deep learning models have advanced automated staging, most rely solely on raw PSG signals and neglect contextual cues used by human experts. We propose a two-stage architecture that combines a Transformer-based per-epoch encoder with a 1D CNN aggregator, and systematically investigates the effect of incorporating explicit context: subject-level clinical metadata (age, sex, BMI) and per-epoch expert event annotations (apneas, desaturations, arousals, periodic breathing). Using the Sleep Heart Health Study (SHHS) cohort (n=8,357), we demonstrate that contextual fusion substantially improves staging accuracy. Compared to a PSG-only baseline (macro-F1 0.7745, micro-F1 0.8774), our final model achieves macro-F1 0.8031 and micro-F1 0.9051, with event annotations contributing the largest gains. Notably, feature fusion outperforms multi-task alternatives that predict the same auxiliary labels. These results highlight that augmenting learned representations with clinically meaningful features enhances both performance and interpretability, without modifying the PSG montage or requiring additional sensors. Our findings support a practical and scalable path toward context-aware, expert-aligned sleep staging systems.

Paper Structure

This paper contains 27 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Confusion matrices for sleep staging performance with different contextual information. (a) Baseline model (PSG-only), (b) model with clinical metadata (+ Clinical), (c) model with per-epoch event annotations (+ Event), and (d) the final model integrating all inputs (+ Clinical & Event). The diagonal elements represent correctly classified epochs, visually demonstrating the improved accuracy as more information is added, particularly in model (d).