Table of Contents
Fetching ...

SleepLM: Natural-Language Intelligence for Human Sleep

Zongzhe Xu, Zitao Shuai, Eideen Mozaffari, Ravi S. Aysola, Rajesh Kumar, Yuzhe Yang

TL;DR

A family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language and exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks are presented.

Abstract

We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena. SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions. Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks. All code and data will be open-sourced.

SleepLM: Natural-Language Intelligence for Human Sleep

TL;DR

A family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language and exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks are presented.

Abstract

We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena. SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions. Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks. All code and data will be open-sourced.
Paper Structure (34 sections, 3 equations, 11 figures, 16 tables)

This paper contains 34 sections, 3 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Sleep-language foundation models (SleepLM). We present a comprehensive study using over 100K hours of multimodal sleep PSG data from over 10,000 individuals. We design a multi-level captioning pipeline that captures PSG information at different temporal and semantic granularities. Across a wide range of downstream tasks and evaluation settings, SleepLM consistently outperforms state-of-the-art LLMs and VLMs. In addition to its predictive capabilities, SleepLM also enables: (A) targeted and controlled insights generation, (B) language-guided event localization, and (C) within- and cross-modal zero-shot retrieval (details in Sec. \ref{['sec:main']}).
  • Figure 2: Multilevel sleep captioning pipeline. We generate three complementary levels of captions from each PSG window: (1) Channel captions summarize modality-specific clinically relevant statistical features commonly used in manual scoring; (2) Local captions capture temporal semantics such as transient morphological changes and sleep event onsets and durations; (3) Global captions describe high-level physiological states such as sleep stage and overall cardiac and respiratory conditions. Example captions are privided in Appendix \ref{['appendix:caption_examples']}.
  • Figure 3: The SleepLM architecture, pretraining objectives, and variants. We introduce ReCoCa, a generic sleep-language pretraining framework that jointly optimizes signal reconstruction, contrastive alignment, and caption generation for multi-channel PSG. By enabling or disabling components, ReCoCa yields common formulations (e.g., CLIP, Cap, CoCa) as special cases.
  • Figure 3: Zero-shot generalization to unseen concepts. We report performance on two held-out respiratory event classification tasks, where SleepLM remains robust across both settings.
  • Figure 4: Zero-shot generalization analysis of SleepLM. We visualize text and signal embeddings with UMAP as a case study of zero-shot concept transfer. SleepLM is capable of clustering previously unseen concepts to semantically related seen concepts.
  • ...and 6 more figures