Table of Contents
Fetching ...

Scaling convolutional neural networks achieves expert-level seizure detection in neonatal EEG

Robert Hogan, Sean R. Mathieson, Aurel Luca, Soraia Ventura, Sean Griffin, Geraldine B. Boylan, John M. O'Toole

TL;DR

This paper demonstrates that scaling both the training data and model size dramatically improves neonatal EEG seizure detection, achieving expert-level equivalence with human experts on two held-out datasets. Using a ConvNeXt-based 1D CNN trained on a large, per-channel annotated dataset, the XL model reaches state-of-the-art metrics (e.g., MCC and AUC) and matches expert inter-rater agreement on unseen data, indicating readiness for further clinical validation. The work also emphasizes robust evaluation across diverse metrics and highlights practical considerations such as data montage, short-seizure detectability, and distribution shifts across centers. Overall, the study provides strong evidence that data- and model-scale strategies can yield clinically viable automated seizure detection for neonatal EEG, potentially enabling broader monitoring and timely neuroprotective interventions.

Abstract

Background: Neonatal seizures are a neurological emergency that require urgent treatment. They are hard to diagnose clinically and can go undetected if EEG monitoring is unavailable. EEG interpretation requires specialised expertise which is not widely available. Algorithms to detect EEG seizures can address this limitation but have yet to reach widespread clinical adoption. Methods: Retrospective EEG data from 332 neonates was used to develop and validate a seizure-detection model. The model was trained and tested with a development dataset ($n=202$) that was annotated with over 12k seizure events on a per-channel basis. This dataset was used to develop a convolutional neural network (CNN) using a modern architecture and training methods. The final model was then validated on two independent multi-reviewer datasets ($n=51$ and $n=79$). Results: Increasing dataset and model size improved model performance: Matthews correlation coefficient (MCC) and Pearson's correlation ($r$) increased by up to 50% with data scaling and up to 15% with model scaling. Over 50k hours of annotated single-channel EEG was used for training a model with 21 million parameters. State-of-the-art was achieved on an open-access dataset (MCC=0.764, $r=0.824$, and AUC=0.982). The CNN attains expert-level performance on both held-out validation sets, with no significant difference in inter-rater agreement among the experts and among experts and algorithm ($Δκ< -0.095$, $p>0.05$). Conclusion: With orders of magnitude increases in data and model scale we have produced a new state-of-the-art model for neonatal seizure detection. Expert-level equivalence on completely unseen data, a first in this field, provides a strong indication that the model is ready for further clinical validation.

Scaling convolutional neural networks achieves expert-level seizure detection in neonatal EEG

TL;DR

This paper demonstrates that scaling both the training data and model size dramatically improves neonatal EEG seizure detection, achieving expert-level equivalence with human experts on two held-out datasets. Using a ConvNeXt-based 1D CNN trained on a large, per-channel annotated dataset, the XL model reaches state-of-the-art metrics (e.g., MCC and AUC) and matches expert inter-rater agreement on unseen data, indicating readiness for further clinical validation. The work also emphasizes robust evaluation across diverse metrics and highlights practical considerations such as data montage, short-seizure detectability, and distribution shifts across centers. Overall, the study provides strong evidence that data- and model-scale strategies can yield clinically viable automated seizure detection for neonatal EEG, potentially enabling broader monitoring and timely neuroprotective interventions.

Abstract

Background: Neonatal seizures are a neurological emergency that require urgent treatment. They are hard to diagnose clinically and can go undetected if EEG monitoring is unavailable. EEG interpretation requires specialised expertise which is not widely available. Algorithms to detect EEG seizures can address this limitation but have yet to reach widespread clinical adoption. Methods: Retrospective EEG data from 332 neonates was used to develop and validate a seizure-detection model. The model was trained and tested with a development dataset () that was annotated with over 12k seizure events on a per-channel basis. This dataset was used to develop a convolutional neural network (CNN) using a modern architecture and training methods. The final model was then validated on two independent multi-reviewer datasets ( and ). Results: Increasing dataset and model size improved model performance: Matthews correlation coefficient (MCC) and Pearson's correlation () increased by up to 50% with data scaling and up to 15% with model scaling. Over 50k hours of annotated single-channel EEG was used for training a model with 21 million parameters. State-of-the-art was achieved on an open-access dataset (MCC=0.764, , and AUC=0.982). The CNN attains expert-level performance on both held-out validation sets, with no significant difference in inter-rater agreement among the experts and among experts and algorithm (, ). Conclusion: With orders of magnitude increases in data and model scale we have produced a new state-of-the-art model for neonatal seizure detection. Expert-level equivalence on completely unseen data, a first in this field, provides a strong indication that the model is ready for further clinical validation.
Paper Structure (28 sections, 1 equation, 7 figures, 7 tables)

This paper contains 28 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: a) A 60 second sample of EEG with per channel seizure annotations shaded. We see that only 3/8 channels contain seizure. b) Annotation and model outputs for 10 hours from C4-O2 of the same EEG recording, from the development test set. The EEG sample in a) corresponds to the first 60s of the first seizure event in b). We see that for increasing model size, the models become more confident, suppressing the output for non-seizure periods while maintaining high agreement in seizure periods. This ease of interpretation would be beneficial to clinical implementation that use the real time model output trace.
  • Figure 2: Scaling training data and model size yields approximate power-law performance gains. Metrics calculated at the segment level for 20% ($41/202$ neonates) of the development dataset in (a) and (b) and neonate level for held-out validation datasets in (c). (a): Performance improves with increasing number of neonates and hours of annotated EEG (counted per channel) in training set; error bars denote min/max over 3 trials. Prominent datasets from the literature are included for comparison Stevenson2019OShea2020Temko2011. (b): Scaling model size over 3 orders of magnitude reveals typical deep-double descent pattern with a performance dip for the Small model before recovering for larger models. Marker size indicates computational cost in giga floating point operations (GFLOP). (c): Scaling model size on the held-out datasets from Cork and Helsinki. We include a linear fit to illustrate the predictability of performance increase. See \ref{['sec:performance_metrics']} for description of metrics used.
  • Figure 3: Summary of per-channel EEG seizure annotations for 77 neonates. (a): number of channels involved in each seizure event. (b): agreement among seizure annotations across channels for each seizure event, as quantified by Fleiss $\kappa$. (c): total seizure duration for each neonates' EEG estimated from each channel separately. For a small number of EEGs, F3 is replaced by Fp1 or Fp3; and likewise, F4 is replaced by Fp2 or Fp4.
  • Figure 4: ConvNeXt block, adapted from convext. Here the $W$ is an integer parameter we use to control the width of our models. The block includes three convolutional layers, one with downsampling (indicated by the $d$) and one-dimensional kernel of length 7 samples, followed by two $1\times 1$ convolutional layers, an equivalent implementation of a multi-layer perceptron. Notable features of the use of layer normalisation (LN) rather than batch normalisation and a Gaussian error linear unit (GELU) instead of the rectified linear unit (ReLU).
  • Figure 5: Extra-Large (XL) model performance by event duration of consensus seizures for Cork (left) and Helsinki (left) validation datasets. A notable finding is that most of the model errors are for short seizures ($<30$ s), where perhaps the 16 s input segment size limits detection resolution.
  • ...and 2 more figures