Table of Contents
Fetching ...

SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference

Jonathan Dan, Amirhossein Shahbazinia, Christodoulos Kechris, David Atienza

TL;DR

The paper reports a large-scale, SzCORE-based seizure-detection challenge using a private 19-channel EMU scalp-EEG dataset from 65 subjects. It demonstrates the need for standardized evaluation to benchmark generalizability, details a Docker-based submission workflow, and analyzes 28 evaluated algorithms, with the Sz Transformer achieving the top event-based F1-score of 43% (sensitivity 37%, precision 45%, FP/day 1.34). The work highlights a notable gap between self-reported and independent performance and introduces a continuously open benchmarking platform to enable reproducible, cross-study comparisons and progressive improvements in seizure detection. Overall, it provides both a rigorous cross-method comparison and a path toward sustainable, clinically relevant benchmarking in epilepsy AI.

Abstract

Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.

SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference

TL;DR

The paper reports a large-scale, SzCORE-based seizure-detection challenge using a private 19-channel EMU scalp-EEG dataset from 65 subjects. It demonstrates the need for standardized evaluation to benchmark generalizability, details a Docker-based submission workflow, and analyzes 28 evaluated algorithms, with the Sz Transformer achieving the top event-based F1-score of 43% (sensitivity 37%, precision 45%, FP/day 1.34). The work highlights a notable gap between self-reported and independent performance and introduces a continuously open benchmarking platform to enable reproducible, cross-study comparisons and progressive improvements in seizure detection. Overall, it provides both a rigorous cross-method comparison and a path toward sustainable, clinically relevant benchmarking in epilepsy AI.

Abstract

Reliable automatic seizure detection from long-term EEG remains a challenge, as current machine learning models often fail to generalize across patients or clinical settings. Manual EEG review remains the clinical standard, underscoring the need for robust models and standardized evaluation. To rigorously assess algorithm performance, we organized a challenge using a private dataset of continuous EEG recordings from 65 subjects (4,360 hours). Expert neurophysiologists annotated the data, providing ground truth for seizure events. Participants were required to detect seizure onset and duration, with evaluation based on event-based metrics, including sensitivity, precision, F1-score, and false positives per day. The SzCORE framework ensured standardized evaluation. The primary ranking criterion was the event-based F1-score, reflecting clinical relevance by balancing sensitivity and false positives. The challenge received 30 submissions from 19 teams, with 28 algorithms evaluated. Results revealed wide variability in performance, with a top F1-score of 43% (sensitivity 37%, precision 45%), highlighting the ongoing difficulty of seizure detection. The challenge also revealed a gap between reported performance and real-world evaluation, emphasizing the importance of rigorous benchmarking. Compared to previous challenges and commercial systems, the best-performing algorithm in this contest showed improved performance. Importantly, the challenge platform now supports continuous benchmarking, enabling reproducible research, integration of new datasets, and clinical evaluation of seizure detection algorithms using a standardized framework.

Paper Structure

This paper contains 18 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Distribution of the data in the Filadelfia dataset.
  • Figure 2: Sensitivity as a function of precision for the • algorithms submitted in the challenge. The background is shaded according to the F1-score, with dashed lines indicating iso-F1 score.
  • Figure 3: Percentage of algorithms that detect a percentage of events. Panel A shows the true positives, and panel B shows the false positives.
  • Figure 4: F1-score self-reported by the algorithm developers versus the event-based F1-score obtained in this challenge. The difference in F1-score is represented by the color of the line.