Table of Contents
Fetching ...

Window Stacking Meta-Models for Clinical EEG Classification

Yixuan Zhu, Rohan Kandasamy, Luke J. W. Canham, David Western

TL;DR

This work tackles the challenge of aggregating windowed EEG data by introducing window-stacking meta-models that arbitrate per-window predictions across multiple stages. A two-stage framework (first-stage deep models and second-stage ANN or XGBoost meta-models) is extended with a third-stage session-level arbitration on AutoTUAB, and its performance is further enhanced by exploring window length, overlapping, and using intermediate first-stage features. On TUAB, the best configurations reach about $99.0\%$ accuracy with near-perfect specificity and high sensitivity, while AutoTUAB approaches human inter-rater ceilings, demonstrating strong generalization and clinical relevance. The study also provides explainability insights via window-importance analyses and SHAP visuals, highlighting the predominance of early windows in decision-making and the potential biases from padding. Overall, the window-stacking approach offers a scalable, interpretable path toward high-accuracy EEG abnormality classification suitable for clinical deployment, with clear avenues for expanding to larger and more diverse datasets.

Abstract

Windowing is a common technique in EEG machine learning classification and other time series tasks. However, a challenge arises when employing this technique: computational expense inhibits learning global relationships across an entire recording or set of recordings. Furthermore, the labels inherited by windows from their parent recordings may not accurately reflect the content of that window in isolation. To resolve these issues, we introduce a multi-stage model architecture, incorporating meta-learning principles tailored to time-windowed data aggregation. We further tested two distinct strategies to alleviate these issues: lengthening the window and utilizing overlapping to augment data. Our methods, when tested on the Temple University Hospital Abnormal EEG Corpus (TUAB), dramatically boosted the benchmark accuracy from 89.8 percent to 99.0 percent. This breakthrough performance surpasses prior performance projections for this dataset and paves the way for clinical applications of machine learning solutions to EEG interpretation challenges. On a broader and more varied dataset from the Temple University Hospital EEG Corpus (TUEG), we attained an accuracy of 86.7%, nearing the assumed performance ceiling set by variable inter-rater agreement on such datasets.

Window Stacking Meta-Models for Clinical EEG Classification

TL;DR

This work tackles the challenge of aggregating windowed EEG data by introducing window-stacking meta-models that arbitrate per-window predictions across multiple stages. A two-stage framework (first-stage deep models and second-stage ANN or XGBoost meta-models) is extended with a third-stage session-level arbitration on AutoTUAB, and its performance is further enhanced by exploring window length, overlapping, and using intermediate first-stage features. On TUAB, the best configurations reach about accuracy with near-perfect specificity and high sensitivity, while AutoTUAB approaches human inter-rater ceilings, demonstrating strong generalization and clinical relevance. The study also provides explainability insights via window-importance analyses and SHAP visuals, highlighting the predominance of early windows in decision-making and the potential biases from padding. Overall, the window-stacking approach offers a scalable, interpretable path toward high-accuracy EEG abnormality classification suitable for clinical deployment, with clear avenues for expanding to larger and more diverse datasets.

Abstract

Windowing is a common technique in EEG machine learning classification and other time series tasks. However, a challenge arises when employing this technique: computational expense inhibits learning global relationships across an entire recording or set of recordings. Furthermore, the labels inherited by windows from their parent recordings may not accurately reflect the content of that window in isolation. To resolve these issues, we introduce a multi-stage model architecture, incorporating meta-learning principles tailored to time-windowed data aggregation. We further tested two distinct strategies to alleviate these issues: lengthening the window and utilizing overlapping to augment data. Our methods, when tested on the Temple University Hospital Abnormal EEG Corpus (TUAB), dramatically boosted the benchmark accuracy from 89.8 percent to 99.0 percent. This breakthrough performance surpasses prior performance projections for this dataset and paves the way for clinical applications of machine learning solutions to EEG interpretation challenges. On a broader and more varied dataset from the Temple University Hospital EEG Corpus (TUEG), we attained an accuracy of 86.7%, nearing the assumed performance ceiling set by variable inter-rater agreement on such datasets.
Paper Structure (35 sections, 2 equations, 10 figures, 1 table)

This paper contains 35 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Generic diagram of a typical deep learning approach to clinical EEG classification.
  • Figure 2: A typical deep learning first-stage model architecture: Firstly, the input passes through the model's feature extraction layer to be transformed into features. Then, these features pass through the classification layer to become logits. Finally, the logits are processed by the softmax layer to yield probability estimates.
  • Figure 3: Performance comparison of single-stage and various two-stage architectures, all with a window length of 60 s, using the TUAB dataset. Each column represents a different arbitration method. Each marker type represents a different first-stage architecture. Each data point is the average accuracy across twenty-five experiments. Note that the accuracy of the 'no arbitration' approach is calculated across all windows ($N = 57482.0$), whereas the accuracy of the arbitration models is calculated across all recordings ($N=2993.0$)
  • Figure 4: Performance of multi-stage methods on AutoTUAB using Deep4 as the first-stage architecture with a window length of 60 s and no overlap. Each marker represents a single experiment and the dashed lines represent the mean accuracy of these 5 experiments. The third-stage model applied 'mean' arbitration to the per-recording outputs of an ANN-based second-stage model, which took 'raw' per-window probabilties as inputs.
  • Figure 5: Effect of window length on accuracy. In all cases, Deep4 is used as the first-stage model.
  • ...and 5 more figures