Table of Contents
Fetching ...

A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG

Emilio Estevan, María Sierra-Torralba, Eduardo López-Larraz, Luis Montesano

TL;DR

This work addresses label scarcity in sleep staging from wearable EEG by systematically evaluating self-supervised learning across multiple SSL methods and three evaluation scenarios using the BOAS (labeled PSG) and HOGAR (unlabeled home) datasets. It demonstrates that SSL pre-training consistently improves downstream sleep-staging accuracy over fully supervised baselines, with notable gains in low-label regimes and the ability to reach inter-scorer–level accuracy with only $5$–$10\%$ of labels. Representations learned via SSL transfer well across datasets and recording conditions, indicating robustness to population and environment variability. The findings support SSL as a practical pathway toward label-efficient, scalable wearable sleep monitoring and motivate the development of EEG foundational models for broader neurophysiological tasks.

Abstract

Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.

A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG

TL;DR

This work addresses label scarcity in sleep staging from wearable EEG by systematically evaluating self-supervised learning across multiple SSL methods and three evaluation scenarios using the BOAS (labeled PSG) and HOGAR (unlabeled home) datasets. It demonstrates that SSL pre-training consistently improves downstream sleep-staging accuracy over fully supervised baselines, with notable gains in low-label regimes and the ability to reach inter-scorer–level accuracy with only of labels. Representations learned via SSL transfer well across datasets and recording conditions, indicating robustness to population and environment variability. The findings support SSL as a practical pathway toward label-efficient, scalable wearable sleep monitoring and motivate the development of EEG foundational models for broader neurophysiological tasks.

Abstract

Wearable EEG devices have emerged as a promising alternative to polysomnography (PSG). As affordable and scalable solutions, their widespread adoption results in the collection of massive volumes of unlabeled data that cannot be analyzed by clinicians at scale. Meanwhile, the recent success of deep learning for sleep scoring has relied on large annotated datasets. Self-supervised learning (SSL) offers an opportunity to bridge this gap, leveraging unlabeled signals to address label scarcity and reduce annotation effort. In this paper, we present the first systematic evaluation of SSL for sleep staging using wearable EEG. We investigate a range of well-established SSL methods and evaluate them on two sleep databases acquired with the Ikon Sleep wearable EEG headband: BOAS, a high-quality benchmark containing PSG and wearable EEG recordings with consensus labels, and HOGAR, a large collection of home-based, self-recorded, and unlabeled recordings. Three evaluation scenarios are defined to study label efficiency, representation quality, and cross-dataset generalization. Results show that SSL consistently improves classification performance by up to 10% over supervised baselines, with gains particularly evident when labeled data is scarce. SSL achieves clinical-grade accuracy above 80% leveraging only 5% to 10% of labeled data, while the supervised approach requires twice the labels. Additionally, SSL representations prove robust to variations in population characteristics, recording environments, and signal quality. Our findings demonstrate the potential of SSL to enable label-efficient sleep staging with wearable EEG, reducing reliance on manual annotations and advancing the development of affordable sleep monitoring systems.

Paper Structure

This paper contains 25 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the deep learning model architecture for automatic sleep staging. Architectural details are available in esparza2024automatic.
  • Figure 2: Illustration of the self-supervised learning strategies used for model pre-training on unlabeled EEG data.
  • Figure 3: UMAP visualization of feature representations from the first five BOAS recordings (Wake, N1, N2, N3, REM). The supervised baseline was trained using BOAS labels, while SSL methods were pre-trained only on unlabeled HOGAR data without supervised fine-tuning. BENDR and MAEEG were excluded, as they produce 2D convolutional outputs rather than 1D feature representation vectors.