Label-Efficient Sleep Staging Using Transformers Pre-trained with Position Prediction
Sayeri Lala, Hanlin Goh, Christopher Sandino
TL;DR
This work addresses the labeling burden in sleep staging by applying a Transformer with tightly integrated feature and temporal encoding and pretraining the entire model using a Masked Patch Position Prediction objective adapted for 1D EEG. The approach yields sustained performance gains across low- and high-data regimes, reducing the need for large labeled datasets by roughly 90% (about 800 subject-equivalents) to reach peak performance observed with extensive labeling. Key findings show that pretraining size scales gains, enabling high accuracy with substantially less labeled data, and that pretraining the whole encoder avoids the performance saturation seen in prior SSL sleep-staging studies. Practically, this SSL paradigm offers a scalable, data-efficient path for deploying sleep staging models across diverse populations and recording setups.
Abstract
Sleep staging is a clinically important task for diagnosing various sleep disorders, but remains challenging to deploy at scale because it because it is both labor-intensive and time-consuming. Supervised deep learning-based approaches can automate sleep staging but at the expense of large labeled datasets, which can be unfeasible to procure for various settings, e.g., uncommon sleep disorders. While self-supervised learning (SSL) can mitigate this need, recent studies on SSL for sleep staging have shown performance gains saturate after training with labeled data from only tens of subjects, hence are unable to match peak performance attained with larger datasets. We hypothesize that the rapid saturation stems from applying a sub-optimal pretraining scheme that pretrains only a portion of the architecture, i.e., the feature encoder, but not the temporal encoder; therefore, we propose adopting an architecture that seamlessly couples the feature and temporal encoding and a suitable pretraining scheme that pretrains the entire model. On a sample sleep staging dataset, we find that the proposed scheme offers performance gains that do not saturate with amount of labeled training data (e.g., 3-5\% improvement in balanced sleep staging accuracy across low- to high-labeled data settings), reducing the amount of labeled training data needed for high performance (e.g., by 800 subjects). Based on our findings, we recommend adopting this SSL paradigm for subsequent work on SSL for sleep staging.
