Table of Contents
Fetching ...

Stanford Sleep Bench: Evaluating Polysomnography Pre-training Methods for Sleep Foundation Models

Magnus Ruud Kjaer, Rahul Thapa, Gauri Ganjoo, Hyatt Moore, Poul Joergen Jennum, Brandon M. Westover, James Zou, Emmanuel Mignot, Bryan He, Andreas Brink-Kjaer

TL;DR

Stanford Sleep Bench tackles the lack of a shared, comprehensive PSG benchmark and the absence of systematic SSRL evaluation for sleep foundation models. It introduces a large-scale, multi-modal PSG dataset with 16 channels and 163,650 hours of recordings, plus 13 disease/mortality endpoints and canonical tasks like sleep staging, apnea diagnosis, and age estimation. The study evaluates multiple SSRL pre-training schemes—contrastive learning, denoising, and reconstruction—across these tasks, finding contrastive learning to excel on complex clinical predictions while reconstruction-based methods perform well on canonical tasks. The work provides open data, pretrained weights, and evaluation code to accelerate reproducibility and progress in sleep AI.

Abstract

Polysomnography (PSG), the gold standard test for sleep analysis, generates vast amounts of multimodal clinical data, presenting an opportunity to leverage self-supervised representation learning (SSRL) for pre-training foundation models to enhance sleep analysis. However, progress in sleep foundation models is hindered by two key limitations: (1) the lack of a shared dataset and benchmark with diverse tasks for training and evaluation, and (2) the absence of a systematic evaluation of SSRL approaches across sleep-related tasks. To address these gaps, we introduce Stanford Sleep Bench, a large-scale PSG dataset comprising 17,467 recordings totaling over 163,000 hours from a major sleep clinic, including 13 clinical disease prediction tasks alongside canonical sleep-related tasks such as sleep staging, apnea diagnosis, and age estimation. We systematically evaluate SSRL pre-training methods on Stanford Sleep Bench, assessing downstream performance across four tasks: sleep staging, apnea diagnosis, age estimation, and disease and mortality prediction. Our results show that multiple pretraining methods achieve comparable performance for sleep staging, apnea diagnosis, and age estimation. However, for mortality and disease prediction, contrastive learning significantly outperforms other approaches while also converging faster during pretraining. To facilitate reproducibility and advance sleep research, we will release Stanford Sleep Bench along with pretrained model weights, training pipelines, and evaluation code.

Stanford Sleep Bench: Evaluating Polysomnography Pre-training Methods for Sleep Foundation Models

TL;DR

Stanford Sleep Bench tackles the lack of a shared, comprehensive PSG benchmark and the absence of systematic SSRL evaluation for sleep foundation models. It introduces a large-scale, multi-modal PSG dataset with 16 channels and 163,650 hours of recordings, plus 13 disease/mortality endpoints and canonical tasks like sleep staging, apnea diagnosis, and age estimation. The study evaluates multiple SSRL pre-training schemes—contrastive learning, denoising, and reconstruction—across these tasks, finding contrastive learning to excel on complex clinical predictions while reconstruction-based methods perform well on canonical tasks. The work provides open data, pretrained weights, and evaluation code to accelerate reproducibility and progress in sleep AI.

Abstract

Polysomnography (PSG), the gold standard test for sleep analysis, generates vast amounts of multimodal clinical data, presenting an opportunity to leverage self-supervised representation learning (SSRL) for pre-training foundation models to enhance sleep analysis. However, progress in sleep foundation models is hindered by two key limitations: (1) the lack of a shared dataset and benchmark with diverse tasks for training and evaluation, and (2) the absence of a systematic evaluation of SSRL approaches across sleep-related tasks. To address these gaps, we introduce Stanford Sleep Bench, a large-scale PSG dataset comprising 17,467 recordings totaling over 163,000 hours from a major sleep clinic, including 13 clinical disease prediction tasks alongside canonical sleep-related tasks such as sleep staging, apnea diagnosis, and age estimation. We systematically evaluate SSRL pre-training methods on Stanford Sleep Bench, assessing downstream performance across four tasks: sleep staging, apnea diagnosis, age estimation, and disease and mortality prediction. Our results show that multiple pretraining methods achieve comparable performance for sleep staging, apnea diagnosis, and age estimation. However, for mortality and disease prediction, contrastive learning significantly outperforms other approaches while also converging faster during pretraining. To facilitate reproducibility and advance sleep research, we will release Stanford Sleep Bench along with pretrained model weights, training pipelines, and evaluation code.

Paper Structure

This paper contains 18 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of Stanford Sleep Bench and the training pipeline for multiple SSRL methods. Stanford Sleep Bench includes four signal modalities: brain activity (BAS), respiration (RESP), electrocardiogram (EKG), and electromyography (EMG). Each modality is encoded independently using a CNN-based encoder, followed by a temporal transformer. SSRL methods—contrastive learning, signal reconstruction, and denoising—are trained separately, with no shared parameters. After pre-training, an LSTM-based prediction head with attention pooling is added for downstream tasks. Note that while all SSRL methods are shown, they are trained as separate models.
  • Figure 2: Comparison of self-supervised representation learning methods across sleep staging, apnea diagnosis, age estimation, and overall disease prediction, including mortality.
  • Figure 3: Few-shot performance of self-supervised representation learning methods on three replicates of [1, 8, 64, 256, 512, 1024] subjects across sleep staging, apnea diagnosis, and age estimation.
  • Figure 4: Swimmer plot of all 12 clinical conditions in Stanford Sleep Bench on the test set, illustrating the timelines of positive cases from the date of PSG recording to the first occurrence of each condition.
  • Figure 5: Swimmer plot of all-cause mortality in Stanford Sleep Bench on the test set, showing the timeline from the date of PSG recording to the occurrence of death.
  • ...and 3 more figures