Table of Contents
Fetching ...

EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models

Wei Xiong, Jiangtong Li, Jie Li, Kun Zhu, Changjun Jiang

TL;DR

EEG-FM-Bench tackles fragmented evaluation of EEG foundation models by offering a standardized, open-source benchmarking framework that spans 14 datasets across 10 EEG paradigms and supports multiple fine-tuning regimes, task organizations, and decoder heads. It pairs standardized data processing with rich diagnostic tools (gradient-space, subspace, CK A, RSA) to uncover why pre-training and transfer succeed or fail. Key findings show a generalization gap in frozen-backbone settings, multi-task learning as a robust regularizer, and a departure from conventional scaling laws where compact, domain-specific architectures outperform larger models. The benchmark enables reproducible comparisons and actionable guidance for designing efficient EEG-FMs, highlighting the importance of pre-training objectives, brain-informed architectures, and multi-task strategies. This work promotes interpretable, scalable progress toward practical EEG representation learning.

Abstract

Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning acts as a critical regularizer to mitigate overfitting in data-scarce EEG contexts; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) model scaling deviates from typical laws, as compact architectures with domain-specific inductive biases consistently outperform significantly larger models. This benchmark enables fair comparison and reproducible analysis, shifting the field from fragmented results to interpretable advances. Code is available at https://github.com/xw1216/EEG-FM-Bench.

EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models

TL;DR

EEG-FM-Bench tackles fragmented evaluation of EEG foundation models by offering a standardized, open-source benchmarking framework that spans 14 datasets across 10 EEG paradigms and supports multiple fine-tuning regimes, task organizations, and decoder heads. It pairs standardized data processing with rich diagnostic tools (gradient-space, subspace, CK A, RSA) to uncover why pre-training and transfer succeed or fail. Key findings show a generalization gap in frozen-backbone settings, multi-task learning as a robust regularizer, and a departure from conventional scaling laws where compact, domain-specific architectures outperform larger models. The benchmark enables reproducible comparisons and actionable guidance for designing efficient EEG-FMs, highlighting the importance of pre-training objectives, brain-informed architectures, and multi-task strategies. This work promotes interpretable, scalable progress toward practical EEG representation learning.

Abstract

Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning acts as a critical regularizer to mitigate overfitting in data-scarce EEG contexts; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) model scaling deviates from typical laws, as compact architectures with domain-specific inductive biases consistently outperform significantly larger models. This benchmark enables fair comparison and reproducible analysis, shifting the field from fragmented results to interpretable advances. Code is available at https://github.com/xw1216/EEG-FM-Bench.

Paper Structure

This paper contains 41 sections, 5 equations, 30 figures, 9 tables.

Figures (30)

  • Figure 1: Scaling analysis: Performance vs. Data and Model Size. We plot the overall balanced accuracy against pre-training data size. Bubble size represents the model parameter count.
  • Figure 2: Overview of the EEG-FM-Bench framework. Data Management: A unified pipeline standardizes preprocessing across 14 datasets covering 10 canonical EEG paradigms. Configurable Evaluation: A configuration-based system facilitates flexible assessments by decoupling model backbones from decoder heads, supporting diverse fine-tuning strategies (e.g., frozen-backbone, LoRA, multi-task). Diagnostic Analyses: The platform provides both qualitative visualizations (e.g., decision basis, feature discrimination) and quantitative metrics to examine model performance and optimization dynamics.
  • Figure 3: Cross-task gradient correlation analysis. The heatmaps visualize the pairwise cosine similarity of gradients between different tasks across specific model components, revealing how tasks interact during optimization. Best viewed via zoom-in.
  • Figure 4: Cross-task subspace affinity analysis. This metric quantifies the geometric alignment of optimization subspaces among tasks, indicating the extent to which different tasks share a common optimization direction within each component. Best viewed via zoom-in.
  • Figure 5: Performance comparison between single-task and multi-task fine-tuning on 14 downstream tasks. Best viewed via zoom-in.
  • ...and 25 more figures