Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Omar Zamzam; Takfarinas Medani; Chinmay Chinara; Richard Leahy

Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Omar Zamzam, Takfarinas Medani, Chinmay Chinara, Richard Leahy

Abstract

Automated seizure detection from long-term clinical videos can substantially reduce manual review time and enable real-time monitoring. However, existing video-based methods often struggle to generalize to unseen subjects due to background bias and reliance on subject-specific appearance cues. We propose a joint-centric attention model that focuses exclusively on body dynamics to improve cross-subject generalization. For each video segment, body joints are detected and joint-centered clips are extracted, suppressing background context. These joint-centered clips are tokenized using a Video Vision Transformer (ViViT), and cross-joint attention is learned to model spatial and temporal interactions between body parts, capturing coordinated movement patterns characteristic of seizure semiology. Extensive cross-subject experiments show that the proposed method consistently outperforms state-of-the-art CNN-, graph-, and transformer-based approaches on unseen subjects.

Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Abstract

Paper Structure (15 sections, 4 equations, 2 figures, 1 table)

This paper contains 15 sections, 4 equations, 2 figures, 1 table.

Introduction
Related Work
Method
Joint-Centric Video Representation
Cross-Joint Attention and Classification
Parameter-Efficient Fine-Tuning
Experimental Setup
Dataset.
Segment construction and labeling.
Train-validation-test split.
Baselines.
Evaluation metrics.
Results
Discussion
Conclusion

Figures (2)

Figure 1: Overview of the proposed joint-centric video-based seizure detection framework. A clinical video segment is first processed using pose estimation to localize major body joints, and joint-centric sub-videos are extracted for each body part. Each joint video is encoded independently using a shared pretrained Video Vision Transformer (ViViT) to obtain joint-level motion tokens. Positional information, including joint location and joint identity, is projected and added to the motion tokens. A multi-head self-attention module models adaptive inter-joint relationships, and the resulting representations are pooled and passed through a linear classification layer to predict seizure presence.
Figure 2: Temporal distribution of model predictions relative to clinical onset for one subject. Predictions are produced at 1 s intervals.

Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Abstract

Learning Cross-Joint Attention for Generalizable Video-Based Seizure Detection

Authors

Abstract

Table of Contents

Figures (2)