The largest EEG-based BCI reproducibility study for open science: the MOABB benchmark

Sylvain Chevallier; Igor Carrara; Bruno Aristimunha; Pierre Guetschel; Sara Sedlar; Bruna Lopes; Sebastien Velut; Salim Khazem; Thomas Moreau

The largest EEG-based BCI reproducibility study for open science: the MOABB benchmark

Sylvain Chevallier, Igor Carrara, Bruno Aristimunha, Pierre Guetschel, Sara Sedlar, Bruna Lopes, Sebastien Velut, Salim Khazem, Thomas Moreau

TL;DR

The paper tackles the reproducibility gap in EEG-based BCIs by conducting the largest open, reproducible benchmark across 36 datasets and 30 pipelines (MI, P300, SSVEP) within a unified MOABB framework. It shows that Riemannian geometry-based classifiers—especially tangent-space variants—consistently outperform Raw and Deep Learning pipelines, while deep learning requires substantial trial counts for competitive performance. The study also integrates environmental impact assessment via Code Carbon and provides a transparent, open-access platform for ongoing benchmarking and cross-dataset comparisons. Collectively, these contributions advance rigor, transparency, and scalability in BCI research, enabling robust cross-study comparisons and guiding practical experimental design.

Abstract

Objective. This study conduct an extensive Brain-computer interfaces (BCI) reproducibility analysis on open electroencephalography datasets, aiming to assess existing solutions and establish open and reproducible benchmarks for effective comparison within the field. The need for such benchmark lies in the rapid industrial progress that has given rise to undisclosed proprietary solutions. Furthermore, the scientific literature is dense, often featuring challenging-to-reproduce evaluations, making comparisons between existing approaches arduous. Approach. Within an open framework, 30 machine learning pipelines (separated into raw signal: 11, Riemannian: 13, deep learning: 6) are meticulously re-implemented and evaluated across 36 publicly available datasets, including motor imagery (14), P300 (15), and SSVEP (7). The analysis incorporates statistical meta-analysis techniques for results assessment, encompassing execution time and environmental impact considerations. Main results. The study yields principled and robust results applicable to various BCI paradigms, emphasizing motor imagery, P300, and SSVEP. Notably, Riemannian approaches utilizing spatial covariance matrices exhibit superior performance, underscoring the necessity for significant data volumes to achieve competitive outcomes with deep learning techniques. The comprehensive results are openly accessible, paving the way for future research to further enhance reproducibility in the BCI domain. Significance. The significance of this study lies in its contribution to establishing a rigorous and transparent benchmark for BCI research, offering insights into optimal methodologies and highlighting the importance of reproducibility in driving advancements within the field.

The largest EEG-based BCI reproducibility study for open science: the MOABB benchmark

TL;DR

Abstract

Paper Structure (42 sections, 7 equations, 15 figures, 10 tables)

This paper contains 42 sections, 7 equations, 15 figures, 10 tables.

Introduction
Open data
BCI pipelines
Evaluation and Reproducibility in BCI
Environmental impact
Contributions
Benchmark methodology
Analysis pipelines inclusion
Evaluation method
Grid search
Statistical analysis
Code Carbon
Datasets
Motor Imagery
P300/ERP
...and 27 more sections

Figures (15)

Figure 1: Within-session evaluation, small rectangles indicate a sample or trial, pastel colors on the two top lines shows the chronological order, bright color on the last three lines indicates training and testing samples/trials.
Figure 2: Visualization of the MOABB datasets, with mi in green, erp in pink/purple and ssvep in yellow/brown. The size of the circle is proportional to the number of subjects and the contrast depends on the number of electrodes.
Figure 3: Average performance of pipelines grouped by category (Deep Learning, Riemannian, and Raw) across the (right-hand vs left-hand), , and paradigms displayed as raincloud plots. Each point in the plot corresponds to the average score of one dataset across all pipelines within a specific category, encompassing all subjects and sessions.
Figure 4: (a) scores are averaged across all sessions, subjects, and datasets within the right-hand vs left-hand paradigm for each category (Deep Learning, Riemannian, Raw), segmented by the number of channels on the y-axis. Box plots overlaid with strip plots show individual ROC-AUC scores. (b) Distribution of scores for the Riemannian pipelines is depicted for the right-hand vs left-hand classification task. The boxes and horizontal black bars denote quartile ranges.
Figure 5: (a) Distributions of scores averaged over all datasets for the right-hand vs left-hand classification task within the pipelines. (b) scores averaged across all sessions, subjects, and datasets within the right-hand vs feet paradigm for the pipelines, segmented based on the number of epochs on the y-axis.
...and 10 more figures

The largest EEG-based BCI reproducibility study for open science: the MOABB benchmark

TL;DR

Abstract

The largest EEG-based BCI reproducibility study for open science: the MOABB benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (15)