Window Stacking Meta-Models for Clinical EEG Classification

Yixuan Zhu; Rohan Kandasamy; Luke J. W. Canham; David Western

Window Stacking Meta-Models for Clinical EEG Classification

Yixuan Zhu, Rohan Kandasamy, Luke J. W. Canham, David Western

TL;DR

This work tackles the challenge of aggregating windowed EEG data by introducing window-stacking meta-models that arbitrate per-window predictions across multiple stages. A two-stage framework (first-stage deep models and second-stage ANN or XGBoost meta-models) is extended with a third-stage session-level arbitration on AutoTUAB, and its performance is further enhanced by exploring window length, overlapping, and using intermediate first-stage features. On TUAB, the best configurations reach about $99.0\%$ accuracy with near-perfect specificity and high sensitivity, while AutoTUAB approaches human inter-rater ceilings, demonstrating strong generalization and clinical relevance. The study also provides explainability insights via window-importance analyses and SHAP visuals, highlighting the predominance of early windows in decision-making and the potential biases from padding. Overall, the window-stacking approach offers a scalable, interpretable path toward high-accuracy EEG abnormality classification suitable for clinical deployment, with clear avenues for expanding to larger and more diverse datasets.

Abstract

Windowing is a common technique in EEG machine learning classification and other time series tasks. However, a challenge arises when employing this technique: computational expense inhibits learning global relationships across an entire recording or set of recordings. Furthermore, the labels inherited by windows from their parent recordings may not accurately reflect the content of that window in isolation. To resolve these issues, we introduce a multi-stage model architecture, incorporating meta-learning principles tailored to time-windowed data aggregation. We further tested two distinct strategies to alleviate these issues: lengthening the window and utilizing overlapping to augment data. Our methods, when tested on the Temple University Hospital Abnormal EEG Corpus (TUAB), dramatically boosted the benchmark accuracy from 89.8 percent to 99.0 percent. This breakthrough performance surpasses prior performance projections for this dataset and paves the way for clinical applications of machine learning solutions to EEG interpretation challenges. On a broader and more varied dataset from the Temple University Hospital EEG Corpus (TUEG), we attained an accuracy of 86.7%, nearing the assumed performance ceiling set by variable inter-rater agreement on such datasets.

Window Stacking Meta-Models for Clinical EEG Classification

TL;DR

accuracy with near-perfect specificity and high sensitivity, while AutoTUAB approaches human inter-rater ceilings, demonstrating strong generalization and clinical relevance. The study also provides explainability insights via window-importance analyses and SHAP visuals, highlighting the predominance of early windows in decision-making and the potential biases from padding. Overall, the window-stacking approach offers a scalable, interpretable path toward high-accuracy EEG abnormality classification suitable for clinical deployment, with clear avenues for expanding to larger and more diverse datasets.

Abstract

Paper Structure (35 sections, 2 equations, 10 figures, 1 table)

This paper contains 35 sections, 2 equations, 10 figures, 1 table.

Introduction
Background
Proposal
Method
Data
Overview
TUAB
AutoTUAB
First-Stage Model
Second-Stage Models for Arbitration (Meta-Models)
Baseline
Pre-Processing of Meta-Model Inputs
Meta-Model Architectures
Overview
Artificial Neural Network (ANN)
...and 20 more sections

Figures (10)

Figure 1: Generic diagram of a typical deep learning approach to clinical EEG classification.
Figure 2: A typical deep learning first-stage model architecture: Firstly, the input passes through the model's feature extraction layer to be transformed into features. Then, these features pass through the classification layer to become logits. Finally, the logits are processed by the softmax layer to yield probability estimates.
Figure 3: Performance comparison of single-stage and various two-stage architectures, all with a window length of 60 s, using the TUAB dataset. Each column represents a different arbitration method. Each marker type represents a different first-stage architecture. Each data point is the average accuracy across twenty-five experiments. Note that the accuracy of the 'no arbitration' approach is calculated across all windows ($N = 57482.0$), whereas the accuracy of the arbitration models is calculated across all recordings ($N=2993.0$)
Figure 4: Performance of multi-stage methods on AutoTUAB using Deep4 as the first-stage architecture with a window length of 60 s and no overlap. Each marker represents a single experiment and the dashed lines represent the mean accuracy of these 5 experiments. The third-stage model applied 'mean' arbitration to the per-recording outputs of an ANN-based second-stage model, which took 'raw' per-window probabilties as inputs.
Figure 5: Effect of window length on accuracy. In all cases, Deep4 is used as the first-stage model.
...and 5 more figures

Window Stacking Meta-Models for Clinical EEG Classification

TL;DR

Abstract

Window Stacking Meta-Models for Clinical EEG Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (10)