Table of Contents
Fetching ...

SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers

Jonathan F. Carter, João Jorge, Oliver Gibson, Lionel Tarassenko

TL;DR

SleepVST presents a transformer-based framework for sleep staging that is first pre-trained on cardio-respiratory waveforms from contact sensors and then transferred to near-infrared video signals. The model achieves state-of-the-art cardio-respiratory staging on SHHS and MESA ($\kappa_T$ ≈ 0.75–0.77) and, when augmented with motion features, sets a new benchmark for four-class video-based sleep staging on the OSV dataset (Acc ≈ 78.8%, $\kappa_T$ ≈ 0.717). By leveraging a frozen SleepVST as a feature extractor and a Random Forest classifier for video data, the approach substantially narrows the gap between contact-based PSG and contact-free camera monitoring, highlighting the promise of camera-based sleep care. The work also emphasizes the value of motion-informed features and transfer learning for robust, non-contact sleep assessment across diverse sensing modalities.

Abstract

Advances in camera-based physiological monitoring have enabled the robust, non-contact measurement of respiration and the cardiac pulse, which are known to be indicative of the sleep stage. This has led to research into camera-based sleep monitoring as a promising alternative to "gold-standard" polysomnography, which is cumbersome, expensive to administer, and hence unsuitable for longer-term clinical studies. In this paper, we introduce SleepVST, a transformer model which enables state-of-the-art performance in camera-based sleep stage classification (sleep staging). After pre-training on contact sensor data, SleepVST outperforms existing methods for cardio-respiratory sleep staging on the SHHS and MESA datasets, achieving total Cohen's kappa scores of 0.75 and 0.77 respectively. We then show that SleepVST can be successfully transferred to cardio-respiratory waveforms extracted from video, enabling fully contact-free sleep staging. Using a video dataset of 50 nights, we achieve a total accuracy of 78.8\% and a Cohen's $κ$ of 0.71 in four-class video-based sleep staging, setting a new state-of-the-art in the domain.

SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers

TL;DR

SleepVST presents a transformer-based framework for sleep staging that is first pre-trained on cardio-respiratory waveforms from contact sensors and then transferred to near-infrared video signals. The model achieves state-of-the-art cardio-respiratory staging on SHHS and MESA ( ≈ 0.75–0.77) and, when augmented with motion features, sets a new benchmark for four-class video-based sleep staging on the OSV dataset (Acc ≈ 78.8%, ≈ 0.717). By leveraging a frozen SleepVST as a feature extractor and a Random Forest classifier for video data, the approach substantially narrows the gap between contact-based PSG and contact-free camera monitoring, highlighting the promise of camera-based sleep care. The work also emphasizes the value of motion-informed features and transfer learning for robust, non-contact sleep assessment across diverse sensing modalities.

Abstract

Advances in camera-based physiological monitoring have enabled the robust, non-contact measurement of respiration and the cardiac pulse, which are known to be indicative of the sleep stage. This has led to research into camera-based sleep monitoring as a promising alternative to "gold-standard" polysomnography, which is cumbersome, expensive to administer, and hence unsuitable for longer-term clinical studies. In this paper, we introduce SleepVST, a transformer model which enables state-of-the-art performance in camera-based sleep stage classification (sleep staging). After pre-training on contact sensor data, SleepVST outperforms existing methods for cardio-respiratory sleep staging on the SHHS and MESA datasets, achieving total Cohen's kappa scores of 0.75 and 0.77 respectively. We then show that SleepVST can be successfully transferred to cardio-respiratory waveforms extracted from video, enabling fully contact-free sleep staging. Using a video dataset of 50 nights, we achieve a total accuracy of 78.8\% and a Cohen's of 0.71 in four-class video-based sleep staging, setting a new state-of-the-art in the domain.
Paper Structure (22 sections, 7 equations, 16 figures, 8 tables)

This paper contains 22 sections, 7 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Sleep staging from near-infrared video signals using SleepVST. (a) We first pre-train the model end-to-end on cardiac (heart) and respiratory (breathing) waveforms $x_{HW}(t)$ and $x_{BW}(t)$ derived from the electrocardiogram (ECG) and a thoracic respiratory belt (THX) respectively. (b) After pre-training, we use the model as a frozen feature extractor, applying it to cardio-respiratory waveforms derived from near-infrared (NIR) video to generate sequences of features. When transferring to video data, we additionally use a set of motion features $f_i(t)$ derived from an optical flow field $\bm{u}(x,y,t)$ as inputs to the classifier C. This approach allows us to utilise much larger contact-sensor datasets to train SleepVST, whilst also enabling the incorporation of motion information when transferring to video data. (c) We evaluate SleepVST using overnight video polysomnography (vPSG) studies, comparing expert-labelled sleep stage sequences $\hat{y}_{1:T}$ against those generated entirely from near-infrared video $y_{1:T}$ using our method.
  • Figure 2: Example (normalized) cardiac and respiratory waveforms, $x_{HW}(t)$ and $x_{BW}(t)$, derived from contact sensors (blue) and video (orange) from the OSV dataset.
  • Figure 3: SleepVST architecture. Each 30-second window of heart ($x_{HW}$) and breathing waveforms ($x_{BW}$) is passed to a patch encoder, which turns them into patch-level features. These features are concatenated and passed to a transformer encoder. During pre-training, a linear layer turns the output feature sequence $z_o$ of length N from SleepVST into sleep stage classifications.
  • Figure 4: Waveform encoder design. Using a series of convolutional layers, the encoder turns sequences of signal patches into sequences of lower-dimensional feature vectors.
  • Figure 5: Example processing of an NIR video frame from the OSV dataset. (a) Real viewpoint. (b) Virtual viewpoint. (c) Head (H), body (B), and outer (O) bed regions. The distance from the camera to the head region is around 1.5 m.
  • ...and 11 more figures