SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers
Jonathan F. Carter, João Jorge, Oliver Gibson, Lionel Tarassenko
TL;DR
SleepVST presents a transformer-based framework for sleep staging that is first pre-trained on cardio-respiratory waveforms from contact sensors and then transferred to near-infrared video signals. The model achieves state-of-the-art cardio-respiratory staging on SHHS and MESA ($\kappa_T$ ≈ 0.75–0.77) and, when augmented with motion features, sets a new benchmark for four-class video-based sleep staging on the OSV dataset (Acc ≈ 78.8%, $\kappa_T$ ≈ 0.717). By leveraging a frozen SleepVST as a feature extractor and a Random Forest classifier for video data, the approach substantially narrows the gap between contact-based PSG and contact-free camera monitoring, highlighting the promise of camera-based sleep care. The work also emphasizes the value of motion-informed features and transfer learning for robust, non-contact sleep assessment across diverse sensing modalities.
Abstract
Advances in camera-based physiological monitoring have enabled the robust, non-contact measurement of respiration and the cardiac pulse, which are known to be indicative of the sleep stage. This has led to research into camera-based sleep monitoring as a promising alternative to "gold-standard" polysomnography, which is cumbersome, expensive to administer, and hence unsuitable for longer-term clinical studies. In this paper, we introduce SleepVST, a transformer model which enables state-of-the-art performance in camera-based sleep stage classification (sleep staging). After pre-training on contact sensor data, SleepVST outperforms existing methods for cardio-respiratory sleep staging on the SHHS and MESA datasets, achieving total Cohen's kappa scores of 0.75 and 0.77 respectively. We then show that SleepVST can be successfully transferred to cardio-respiratory waveforms extracted from video, enabling fully contact-free sleep staging. Using a video dataset of 50 nights, we achieve a total accuracy of 78.8\% and a Cohen's $κ$ of 0.71 in four-class video-based sleep staging, setting a new state-of-the-art in the domain.
