Window Stacking Meta-Models for Clinical EEG Classification
Yixuan Zhu, Rohan Kandasamy, Luke J. W. Canham, David Western
TL;DR
This work tackles the challenge of aggregating windowed EEG data by introducing window-stacking meta-models that arbitrate per-window predictions across multiple stages. A two-stage framework (first-stage deep models and second-stage ANN or XGBoost meta-models) is extended with a third-stage session-level arbitration on AutoTUAB, and its performance is further enhanced by exploring window length, overlapping, and using intermediate first-stage features. On TUAB, the best configurations reach about $99.0\%$ accuracy with near-perfect specificity and high sensitivity, while AutoTUAB approaches human inter-rater ceilings, demonstrating strong generalization and clinical relevance. The study also provides explainability insights via window-importance analyses and SHAP visuals, highlighting the predominance of early windows in decision-making and the potential biases from padding. Overall, the window-stacking approach offers a scalable, interpretable path toward high-accuracy EEG abnormality classification suitable for clinical deployment, with clear avenues for expanding to larger and more diverse datasets.
Abstract
Windowing is a common technique in EEG machine learning classification and other time series tasks. However, a challenge arises when employing this technique: computational expense inhibits learning global relationships across an entire recording or set of recordings. Furthermore, the labels inherited by windows from their parent recordings may not accurately reflect the content of that window in isolation. To resolve these issues, we introduce a multi-stage model architecture, incorporating meta-learning principles tailored to time-windowed data aggregation. We further tested two distinct strategies to alleviate these issues: lengthening the window and utilizing overlapping to augment data. Our methods, when tested on the Temple University Hospital Abnormal EEG Corpus (TUAB), dramatically boosted the benchmark accuracy from 89.8 percent to 99.0 percent. This breakthrough performance surpasses prior performance projections for this dataset and paves the way for clinical applications of machine learning solutions to EEG interpretation challenges. On a broader and more varied dataset from the Temple University Hospital EEG Corpus (TUEG), we attained an accuracy of 86.7%, nearing the assumed performance ceiling set by variable inter-rater agreement on such datasets.
