Table of Contents
Fetching ...

sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals

Weixuan Yuan, Zengrui Jin, Yichen Wang, Donglin Xie, Ziyi Ye, Chao Zhang, Xuesong Chen

TL;DR

Results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.

Abstract

Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO$_2$). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present \texttt{sleep2vec}, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. \texttt{sleep2vec} is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a \textit{Demography, Age, Site \& History-aware InfoNCE} objective that incorporates physiological and acquisition metadata (\textit{e.g.}, age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On downstream sleep staging and clinical outcome assessment, \texttt{sleep2vec} consistently outperforms strong baselines and remains robust to any subset of available modalities and sensor dropout. We further characterize, to our knowledge for the first time, scaling laws for nocturnal biosignals with respect to modality diversity and model capacity. Together, these results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.

sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals

TL;DR

Results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.

Abstract

Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present \texttt{sleep2vec}, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. \texttt{sleep2vec} is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a \textit{Demography, Age, Site \& History-aware InfoNCE} objective that incorporates physiological and acquisition metadata (\textit{e.g.}, age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On downstream sleep staging and clinical outcome assessment, \texttt{sleep2vec} consistently outperforms strong baselines and remains robust to any subset of available modalities and sensor dropout. We further characterize, to our knowledge for the first time, scaling laws for nocturnal biosignals with respect to modality diversity and model capacity. Together, these results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.
Paper Structure (44 sections, 7 equations, 10 figures, 16 tables)

This paper contains 44 sections, 7 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Polysomnography (PSG) captures diverse physiological signals, illustrated here as 30-second segments for each modality. High sampling rate electrophysiological channels include EEG, EMG, EOG, and ECG, while lower sampling rate cardiopulmonary and oximetry channels encompass Nasal Airflow, Abdominal/Thoracic Belt (ABD/Thor Belt), and SpO$_2$. Inter-Beat Interval (IBI) and Respiratory effort (RESP) are interval-derived features computed from ECG and respiratory channels (ABD/Thor Belts when available, otherwise Nasal Airflow), respectively. Together, these concurrent nocturnal signals provide complementary perspectives on a shared latent physiological state, highlighting the multimodal complexity inherent to sleep monitoring.
  • Figure 2: An illustration of the multimodal pre-training framework. Each overnight PSG recording is partitioned into intra-subject segments (different temporal slices from the same individual) and inter-subject segments (slices from different individuals), which are independently tokenized via modality-specific MLP tokenizers. A learnable [CLS] token is prepended to each masked sequence before processing through a modality-agnostic RoFormer backbone. Hidden states from the backbone at each timestep are projected into a shared alignment space, enabling timestep-wise pairwise contrastive alignment across modalities.
  • Figure 3: t-SNE visualization of encoder embeddings comparing random initialization and post-pre-training results. Left Panel (Subject–Modality Alignment): Visualization of [CLS] token embeddings shows that pre-training effectively clusters embeddings from different modalities into distinct, subject-specific groups, indicating aligned subject-level physiological states. Right Panel (Time–Modality Alignment): Visualization of timestep-level embeddings, dot sizes indicate temporal ordering (larger $\rightarrow$ later). Pre-trained embeddings form structured trajectories, contrasting with the scattered distribution observed prior to training.
  • Figure 4: Leave-one-out analysis on the SHHS sleep staging task. Each bar represents model accuracy when one of the nine modalities is excluded during both pre-training and fine-tuning. The observed drop in accuracy relative to the full channels baseline (labeled "None") reflects the contribution and relative importance of each individual modality to the overall model performance.
  • Figure 5: ROC-AUC scores for disease prediction tasks using varying numbers of modalities ($N$) on the SHHS dataset. Results are averaged across all possible modality combinations of size $N$
  • ...and 5 more figures