Table of Contents
Fetching ...

Scope and Arbitration in Machine Learning Clinical EEG Classification

Yixuan Zhu, Luke J. W. Canham, David Western

TL;DR

The paper addresses the problem that per-window labels derived from session-level EEG annotations can mislead classifiers. It tests two strategies—extending window length and adding a separate arbitration stage to aggregate window predictions—on TUAB, achieving a new average accuracy of 93.3%, surpassing prior upper estimates. The study demonstrates that longer windows improve sensitivity and that a simple, dedicated arbitration model can outperform baseline aggregation methods, with results depending on the first-stage architecture. These findings advance the clinical viability of EEG classifiers and offer a framework transferable to other time-series tasks. $p_{abnormal}$ thresholds and arbitration dynamics are framed to handle uncertain cases, underscoring practical considerations for deployment.

Abstract

A key task in clinical EEG interpretation is to classify a recording or session as normal or abnormal. In machine learning approaches to this task, recordings are typically divided into shorter windows for practical reasons, and these windows inherit the label of their parent recording. We hypothesised that window labels derived in this manner can be misleading for example, windows without evident abnormalities can be labelled `abnormal' disrupting the learning process and degrading performance. We explored two separable approaches to mitigate this problem: increasing the window length and introducing a second-stage model to arbitrate between the window-specific predictions within a recording. Evaluating these methods on the Temple University Hospital Abnormal EEG Corpus, we significantly improved state-of-the-art average accuracy from 89.8 percent to 93.3 percent. This result defies previous estimates of the upper limit for performance on this dataset and represents a major step towards clinical translation of machine learning approaches to this problem.

Scope and Arbitration in Machine Learning Clinical EEG Classification

TL;DR

The paper addresses the problem that per-window labels derived from session-level EEG annotations can mislead classifiers. It tests two strategies—extending window length and adding a separate arbitration stage to aggregate window predictions—on TUAB, achieving a new average accuracy of 93.3%, surpassing prior upper estimates. The study demonstrates that longer windows improve sensitivity and that a simple, dedicated arbitration model can outperform baseline aggregation methods, with results depending on the first-stage architecture. These findings advance the clinical viability of EEG classifiers and offer a framework transferable to other time-series tasks. thresholds and arbitration dynamics are framed to handle uncertain cases, underscoring practical considerations for deployment.

Abstract

A key task in clinical EEG interpretation is to classify a recording or session as normal or abnormal. In machine learning approaches to this task, recordings are typically divided into shorter windows for practical reasons, and these windows inherit the label of their parent recording. We hypothesised that window labels derived in this manner can be misleading for example, windows without evident abnormalities can be labelled `abnormal' disrupting the learning process and degrading performance. We explored two separable approaches to mitigate this problem: increasing the window length and introducing a second-stage model to arbitrate between the window-specific predictions within a recording. Evaluating these methods on the Temple University Hospital Abnormal EEG Corpus, we significantly improved state-of-the-art average accuracy from 89.8 percent to 93.3 percent. This result defies previous estimates of the upper limit for performance on this dataset and represents a major step towards clinical translation of machine learning approaches to this problem.
Paper Structure (21 sections, 6 figures, 2 tables)

This paper contains 21 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Generic diagram of a typical deep learning approach to clinical EEG classification, as used e.g. by schirrmeister2017deep
  • Figure 2: 'Raw' and 'Histogram' pre-processing for the arbitration model. Each small square in 'Raw' is the output of the first-stage model (probability of 'abnormal') for one window. In this example there are 16 windows in the recording. In the general case, since we use the data between 1 and 21 minutes in a recording at most, a recording contains at most 20 windows with a length of 1 minute. When there are fewer than 20, we pad zero at the end. Then we count the 'Raw' into a histogram of ten equal bins across the range 0-10.
  • Figure 3: 'Hybrid' pre-processing for the arbitration model.
  • Figure 4: Performance of different arbitration models using window lengths of (a) 60 s and (b) 600 s. Points with the same marker shape come from the same instance of the first-stage model. The dashed lines represent the mean for each arbitration method.
  • Figure 5: Effect of window length on (a) accuracy, (b) sensitivity, and (c) specificity. Note that the accuracy of the 'no_arbitration' approach is calculated across all windows ($4340.0 \leq N \leq 57482.0$, depending on window length), whereas the accuracy of the arbitration models is calculated across all recordings ($N=2993.0$).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Conjecture 1: Increased Window Length
  • Conjecture 2: Arbitration