Scope and Arbitration in Machine Learning Clinical EEG Classification
Yixuan Zhu, Luke J. W. Canham, David Western
TL;DR
The paper addresses the problem that per-window labels derived from session-level EEG annotations can mislead classifiers. It tests two strategies—extending window length and adding a separate arbitration stage to aggregate window predictions—on TUAB, achieving a new average accuracy of 93.3%, surpassing prior upper estimates. The study demonstrates that longer windows improve sensitivity and that a simple, dedicated arbitration model can outperform baseline aggregation methods, with results depending on the first-stage architecture. These findings advance the clinical viability of EEG classifiers and offer a framework transferable to other time-series tasks. $p_{abnormal}$ thresholds and arbitration dynamics are framed to handle uncertain cases, underscoring practical considerations for deployment.
Abstract
A key task in clinical EEG interpretation is to classify a recording or session as normal or abnormal. In machine learning approaches to this task, recordings are typically divided into shorter windows for practical reasons, and these windows inherit the label of their parent recording. We hypothesised that window labels derived in this manner can be misleading for example, windows without evident abnormalities can be labelled `abnormal' disrupting the learning process and degrading performance. We explored two separable approaches to mitigate this problem: increasing the window length and introducing a second-stage model to arbitrate between the window-specific predictions within a recording. Evaluating these methods on the Temple University Hospital Abnormal EEG Corpus, we significantly improved state-of-the-art average accuracy from 89.8 percent to 93.3 percent. This result defies previous estimates of the upper limit for performance on this dataset and represents a major step towards clinical translation of machine learning approaches to this problem.
