EEG-VLM: A Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction
Xihe Qiu, Gengchen Ma, Haoyu Wang, Chen Zhan, Xiaoyu Tan, Shuo Li
TL;DR
This paper tackles EEG-based sleep stage classification by extending vision–language models with a visual enhancement module, multi-level feature alignment, and stage‑wise Chain-of-Thought reasoning. The EEG‑VLM architecture combines a specialized visual pathway with CLIP‑based features, producing hierarchical tokens that are fused and fed into a language model, guided by stage-specific CoT prompts to improve interpretability. Across Sleep‑EDFx and external hospital data, EEG‑VLM demonstrates substantial gains in accuracy, MF1, and kappa, with particular robustness on ambiguous stages like Wake, N1, and REM, and shows better cross‑dataset generalization than purely visual baselines. The work provides a practical, explainable multimodal framework for clinical EEG analysis and suggests directions for integrating additional physiological signals in VLMs.
Abstract
Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.
