Table of Contents
Fetching ...

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Jovana Kljajic, John M. O'Toole, Robert Hogan, Tamara Skoric

TL;DR

This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection by enabling a thorough and honest appraisal of expert-level AI performance.

Abstract

Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

TL;DR

This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection by enabling a thorough and honest appraisal of expert-level AI performance.

Abstract

Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.

Paper Structure

This paper contains 18 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overview of the Synthetic Annotation Framework (Methods A and B). Method A simulates multiple rater categories with different tendencies, well-calibrated (no shift), overraters (positive shift), and underraters (negative shift), by adding uniform shifts to the ground truth. Method B introduces predefined FP and FN rates by flipping selected probabilities in the ground truth, preserving the probabilistic structure while altering class labels.
  • Figure 2: Overview of different human-expert equivalence tests. The multi-rater agreement statistical Turing test replaces each rater with the AI to assess the impact on IRA ($\Delta \kappa_i=\kappa_{AI, Consensus}-\kappa_{raters}$), evaluating if the AI can substitute a human. If the 5th percentile of $\Delta \kappa$ is higher than margin, the model is considered sufficiently reliable. The IRA vs. AI-Consensus Agreement test compares IRA among human raters ($\kappa_{raters}$) and AI-majority consensus agreement ($\kappa_{AI, Consensus}$), using bootstrapping to estimate $95\%$ CIs. In the Pairwise Metric Statistical Turing Test, each rater serves as the reference to compute pairwise metrics M (e.g., MCC) with others and the AI. Differences between human-human scores are used to define non-inferiority margins (e.g., $MCC_{R1,R2} - MCC_{R1,R3}$), determining whether the AI performs within the range of human variability.
  • Figure 3: Influence of class imbalance on AUC and alternative performance metrics. Fixed $10\%$ error rate is used ($90\%$ sensitivity and specificity), which increases the relative value of FP and decreases PPV. AUC, in contrast to all other metrics, remains constant despite a significant drop in PPV and an increasing ratio FP/TP.
  • Figure 4: Impact of rater agreement and number of raters on consensus annotation. (a) Percent of data that gets excluded in a unanimous consensus case. (b) Average strength of majority in percents. Data points from real-world datasets: the Helsinki dataset b17 and the Cork dataset b9, are included.
  • Figure A1: Effect of class imbalance and increasing disagreement on IRA metrics. The x-axis represents the percentage of TPs misclassified as FNs, with an equal number of FPs introduced to maintain the original class distribution. Cohen’s $\kappa$, Fleiss’ $\kappa$, and Krippendorff’s $\alpha$ yield identical values in this setting and are grouped as "Kappa metrics" (a). These retain interpretability across imbalances, while AC1 (b) collapses under high imbalance ($>10:1$), becoming insensitive to disagreement and producing inflated scores even when minority-class errors dominate.
  • ...and 5 more figures