Table of Contents
Fetching ...

EEG-Bench: A Benchmark for EEG Foundation Models in Clinical Applications

Ard Kastrati, Josua Bürki, Jonas Lauer, Cheng Xuan, Raffaele Iaquinto, Roger Wattenhofer

TL;DR

Addressing the need for robust evaluation of EEG foundation models in clinical contexts, the paper introduces EEG-Bench, a benchmark spanning 14 public datasets and 11 diagnostic/event tasks with minimal preprocessing. It provides a standardized framework that supports both classical machine learning baselines and modern EEG foundation models (BENDR, Neuro-GPT, LaBraM), with cross-subject evaluation and long-recording handling via chunked embeddings. Across experiments, LaBraM often leads or is competitive, but simple methods like LDA/SVM excel in data-limited settings and some tasks (epilepsy, sleep stages, schizophrenia), highlighting distribution-shift vulnerabilities. The work delivers reproducible data/code and promotes generalizable EEG decoding, identifying concrete directions for expanding benchmarks and models.

Abstract

We introduce a unified benchmarking framework focused on evaluating EEG-based foundation models in clinical applications. The benchmark spans 11 well-defined diagnostic tasks across 14 publicly available EEG datasets, including epilepsy, schizophrenia, Parkinson's disease, OCD, and mild traumatic brain injury. It features minimal preprocessing, standardized evaluation protocols, and enables side-by-side comparisons of classical baselines and modern foundation models. Our results show that while foundation models achieve strong performance in certain settings, simpler models often remain competitive, particularly under clinical distribution shifts. To facilitate reproducibility and adoption, we release all prepared data and code in an accessible and extensible format.

EEG-Bench: A Benchmark for EEG Foundation Models in Clinical Applications

TL;DR

Addressing the need for robust evaluation of EEG foundation models in clinical contexts, the paper introduces EEG-Bench, a benchmark spanning 14 public datasets and 11 diagnostic/event tasks with minimal preprocessing. It provides a standardized framework that supports both classical machine learning baselines and modern EEG foundation models (BENDR, Neuro-GPT, LaBraM), with cross-subject evaluation and long-recording handling via chunked embeddings. Across experiments, LaBraM often leads or is competitive, but simple methods like LDA/SVM excel in data-limited settings and some tasks (epilepsy, sleep stages, schizophrenia), highlighting distribution-shift vulnerabilities. The work delivers reproducible data/code and promotes generalizable EEG decoding, identifying concrete directions for expanding benchmarks and models.

Abstract

We introduce a unified benchmarking framework focused on evaluating EEG-based foundation models in clinical applications. The benchmark spans 11 well-defined diagnostic tasks across 14 publicly available EEG datasets, including epilepsy, schizophrenia, Parkinson's disease, OCD, and mild traumatic brain injury. It features minimal preprocessing, standardized evaluation protocols, and enables side-by-side comparisons of classical baselines and modern foundation models. Our results show that while foundation models achieve strong performance in certain settings, simpler models often remain competitive, particularly under clinical distribution shifts. To facilitate reproducibility and adoption, we release all prepared data and code in an accessible and extensible format.

Paper Structure

This paper contains 34 sections, 1 figure, 21 tables.

Figures (1)

  • Figure 1: Dataset diversity across three dimensions: (Left) Distribution of EEG hardware systems used across datasets, each with potentially different electrode layouts, channel counts, and sampling rates. (Right) Gender and age distribution of subjects, covering a broad range from infants (1 year old) to elderly adults (up to 80 years), reflecting the inclusion of both pediatric and geriatric populations. The age bars show mean and standard deviation.