Table of Contents
Fetching ...

Standing on the Shoulders of Giants: Rethinking EEG Foundation Model Pretraining via Multi-Teacher Distillation

Chenqi Li, Yu Liu, Shuo Zhang, Timothy Denison, Tingting Zhu

TL;DR

This work proposes the Multi-Teacher Distillation Pretraining (MTDP) framework for pretraining EEG foundation models via a two-stage multi-teacher distillation, and demonstrates that mainstream foundation models, such as those from vision and time series, transfer surprisingly well to EEG domain.

Abstract

Pretraining for electroencephalogram (EEG) foundation models has predominantly relied on self-supervised masked reconstruction, a paradigm largely adapted from and inspired by the success of vision and language foundation models. However, unlike images and text, EEG datasets are notoriously expensive to collect and characterized by low signal-to-noise ratio. These challenges introduce difficulties in scaling the EEG foundation models and capturing the underlying neural semantics through reconstruction. In this work, we ask the question: can we stand on the shoulders of well-established foundation models from well-represented modalities to bootstrap the pretraining of EEG foundation models? We first demonstrate that mainstream foundation models, such as those from vision and time series, transfer surprisingly well to EEG domain. To this end, we propose the Multi-Teacher Distillation Pretraining (MTDP) framework for pretraining EEG foundation models via a two-stage multi-teacher distillation. In the first stage, we introduce a learnable gating network to fuse representations from diverse teachers (e.g., DINOv3 and Chronos) via a masked latent denoising objective. In the second stage, we distill the fused representation into an EEG foundation model. Extensive evaluations across 9 downstream tasks and 12 datasets demonstrate that our MTDP-based EEG foundation model outperforms its self-supervised counterparts while requiring only 25% of the pretraining data.

Standing on the Shoulders of Giants: Rethinking EEG Foundation Model Pretraining via Multi-Teacher Distillation

TL;DR

This work proposes the Multi-Teacher Distillation Pretraining (MTDP) framework for pretraining EEG foundation models via a two-stage multi-teacher distillation, and demonstrates that mainstream foundation models, such as those from vision and time series, transfer surprisingly well to EEG domain.

Abstract

Pretraining for electroencephalogram (EEG) foundation models has predominantly relied on self-supervised masked reconstruction, a paradigm largely adapted from and inspired by the success of vision and language foundation models. However, unlike images and text, EEG datasets are notoriously expensive to collect and characterized by low signal-to-noise ratio. These challenges introduce difficulties in scaling the EEG foundation models and capturing the underlying neural semantics through reconstruction. In this work, we ask the question: can we stand on the shoulders of well-established foundation models from well-represented modalities to bootstrap the pretraining of EEG foundation models? We first demonstrate that mainstream foundation models, such as those from vision and time series, transfer surprisingly well to EEG domain. To this end, we propose the Multi-Teacher Distillation Pretraining (MTDP) framework for pretraining EEG foundation models via a two-stage multi-teacher distillation. In the first stage, we introduce a learnable gating network to fuse representations from diverse teachers (e.g., DINOv3 and Chronos) via a masked latent denoising objective. In the second stage, we distill the fused representation into an EEG foundation model. Extensive evaluations across 9 downstream tasks and 12 datasets demonstrate that our MTDP-based EEG foundation model outperforms its self-supervised counterparts while requiring only 25% of the pretraining data.
Paper Structure (29 sections, 9 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 9 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of EEG Foundation Model Pretraining. a) The conventional self-supervised pretraining where EEG foundation model reconstructs missing patches in the temporal, frequency or latent-domain. b) The proposed framework to bootstrap EEG foundation model pretraining by standing on the shoulders of well-established foundation models from well-represented modalities.
  • Figure 2: Linear probing performance of CBraMod and DINOv3 on EEG downstream tasks. Balanced Accuracy (%).
  • Figure 3: Overview of Two-Stage Multi-Teacher Distillation Pretraining (MTDP). Stage 1: Teacher Representation Fusion. A learnable gating network is introduced to weigh and fuse representations from frozen teacher models. The gate is trained via a masked latent denoising objective. Stage 2: Knowledge Distillation. The fused teacher representation acts as the target to pretrain the student EEG foundation model. The distillation loss is minimized to align the student representations with the fused representations.
  • Figure 4: Linear probing performance of CBraMod and CBraMod-MTDP on EEG downstream tasks. Balanced Accuracy (%).
  • Figure 5: Loss curve of stage 1 and stage 2 pretraining