EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models
Wei Xiong, Jiangtong Li, Jie Li, Kun Zhu, Changjun Jiang
TL;DR
EEG-FM-Bench tackles fragmented evaluation of EEG foundation models by offering a standardized, open-source benchmarking framework that spans 14 datasets across 10 EEG paradigms and supports multiple fine-tuning regimes, task organizations, and decoder heads. It pairs standardized data processing with rich diagnostic tools (gradient-space, subspace, CK A, RSA) to uncover why pre-training and transfer succeed or fail. Key findings show a generalization gap in frozen-backbone settings, multi-task learning as a robust regularizer, and a departure from conventional scaling laws where compact, domain-specific architectures outperform larger models. The benchmark enables reproducible comparisons and actionable guidance for designing efficient EEG-FMs, highlighting the importance of pre-training objectives, brain-informed architectures, and multi-task strategies. This work promotes interpretable, scalable progress toward practical EEG representation learning.
Abstract
Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning acts as a critical regularizer to mitigate overfitting in data-scarce EEG contexts; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) model scaling deviates from typical laws, as compact architectures with domain-specific inductive biases consistently outperform significantly larger models. This benchmark enables fair comparison and reproducible analysis, shifting the field from fragmented results to interpretable advances. Code is available at https://github.com/xw1216/EEG-FM-Bench.
