VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

Yankun Xu; Junzhe Wang; Yun-Hsuan Chen; Jie Yang; Wenjie Ming; Shuang Wang; Mohamad Sawan

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

Yankun Xu, Junzhe Wang, Yun-Hsuan Chen, Jie Yang, Wenjie Ming, Shuang Wang, Mohamad Sawan

TL;DR

This work tackles the challenge of real-time, video-based epileptic seizure onset detection without relying on EEG by introducing VSViG, a skeleton-based spatiotemporal Vision Graph network that uses joint-centered patch embeddings. The method fine-tunes a pose estimator for epileptic patients, constructs a partitioned skeleton graph, and applies spatial and temporal graph convolutions, followed by a probabilistic, accumulative decision rule to detect onset with low latency. It achieves state-of-the-art accuracy (RMSE ≈ 5.9% for the full model) and efficiency (FLOPs ~1.76G for VSViG; 0.44G for VSViG-Light), enables early detection (latency ≈ 5.1 s after EEG onset and ≈ 13.1 s before clinical onset) with zero false detections in tested cases, and offers interpretable visualizations of seizure-relevant partitions. The approach holds practical potential for continuous remote monitoring and could be extended to other movement-related disorders such as Parkinson’s disease or fall detection.

Abstract

An accurate and efficient epileptic seizure onset detection can significantly benefit patients. Traditional diagnostic methods, primarily relying on electroencephalograms (EEGs), often result in cumbersome and non-portable solutions, making continuous patient monitoring challenging. The video-based seizure detection system is expected to free patients from the constraints of scalp or implanted EEG devices and enable remote monitoring in residential settings. Previous video-based methods neither enable all-day monitoring nor provide short detection latency due to insufficient resources and ineffective patient action recognition techniques. Additionally, skeleton-based action recognition approaches remain limitations in identifying subtle seizure-related actions. To address these challenges, we propose a novel Video-based Seizure detection model via a skeleton-based spatiotemporal Vision Graph neural network (VSViG) for its efficient, accurate and timely purpose in real-time scenarios. Our experimental results indicate VSViG outperforms previous state-of-the-art action recognition models on our collected patients' video data with higher accuracy (5.9% error), lower FLOPs (0.4G), and smaller model size (1.4M). Furthermore, by integrating a decision-making rule that combines output probabilities and an accumulative function, we achieve a 5.1 s detection latency after EEG onset, a 13.1 s detection advance before clinical onset, and a zero false detection rate. The project homepage is available at: https://github.com/xuyankun/VSViG/

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

TL;DR

Abstract

Paper Structure (18 sections, 6 equations, 8 figures, 6 tables)

This paper contains 18 sections, 6 equations, 8 figures, 6 tables.

Introduction
Related Work
VSViG Framework
Pose estimation model fine-tuning
Patch extraction and patch embedding
Graph construction with partition strategy
Partitioning spatiotemporal graph modeling
VSViG network architecture
Seizure onset decision-making
Experiments
Dataset
Experimental setting
Seizure-related action recognition performance
Visualization
Ablation study
...and 3 more sections

Figures (8)

Figure 1: Motivation of proposed skeleton-based patch embedding. The left shows real seizure-related actions; The middle shows challenges of traditional skeleton-based approaches; The right shows our strategy to address challenges.
Figure 2: Model comparison of achieved errors and the number of parameters.
Figure 3: Proposed skeleton-based VSViG framework. Starting from raw RGB frames, we extract skeleton-based patches around each joint by fusing RGB frames and pose heatmaps. Then features from patches are generated by a patch embedding. In spatiotemporal ViG modeling, a partition strategy with proposed inter-, intra-, and dynamic partition operations is used, as shown in the bottom three subfigures.
Figure 4: VSViG architecture. VSViG consists of four residual spatiotemporal (ST) ViG stages and each stage contains several proposed spatial and temporal modeling layers. $N,T,C$ denotes the number of joints, frames and channels, and $H,W$ stand for the height and width of extracted patches.
Figure 5: Data labeling for a regression-based task. For each seizure, a video recording is categorized into 3 different periods: interictal (label: 0), ictal (label: 1), and transition (label: 0 to 1 in exponential function according to clinical phenomenon).
...and 3 more figures

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

TL;DR

Abstract

VSViG: Real-time Video-based Seizure Detection via Skeleton-based Spatiotemporal ViG

Authors

TL;DR

Abstract

Table of Contents

Figures (8)