S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention

Pierre Guetschel; Thomas Moreau; Michael Tangermann

S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention

Pierre Guetschel, Thomas Moreau, Michael Tangermann

TL;DR

Signal-JEPA is introduced for representing EEG recordings which includes a novel domain-specific spatial block masking strategy and three novel architectures for downstream classification which highlight the importance of spatial filtering for accurate downstream classification.

Abstract

Motivated by the challenge of seamless cross-dataset transfer in EEG signal processing, this article presents an exploratory study on the use of Joint Embedding Predictive Architectures (JEPAs). In recent years, self-supervised learning has emerged as a promising approach for transfer learning in various domains. However, its application to EEG signals remains largely unexplored. In this article, we introduce Signal-JEPA for representing EEG recordings which includes a novel domain-specific spatial block masking strategy and three novel architectures for downstream classification. The study is conducted on a 54 subjects dataset and the downstream performance of the models is evaluated on three different BCI paradigms: motor imagery, ERP and SSVEP. Our study provides preliminary evidence for the potential of JEPAs in EEG signal encoding. Notably, our results highlight the importance of spatial filtering for accurate downstream classification and reveal an influence of the length of the pre-training examples but not of the mask size on the downstream performance.

S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention

TL;DR

Abstract

Paper Structure (6 figures)

This paper contains 6 figures.

Figures (6)

Figure 1: S-JEPA training procedure. The framework takes as input EEG recordings with $C$ channels and $T$ time samples, and binary masks of length $L$. First, the Local encoder independently transforms $t$ windows from each channel into $C\times t=L$ embedding vectors, called tokens, of dimensionality $d$. Then, the tokens are marked according to their originating channel and temporal position, and are flattened into a sequence of length $L$. Subsequently, only the unmasked tokens are passed to the Contextual encoder, while the full tokens sequence is given to the Contextual target encoder to generate training targets. Finally, the Predictor attempts to reconstruct the masked tokens and its predictions are compared with the targets using an L1 loss. During the optimisation, the parameters of the Contextual target encoder are not trained via gradient backpropagation but follow those of the Contextual encoder by Exponential Moving Average (EMA). Figure inspired from bardesRevisitingFeaturePrediction2024.
Figure 2: Visualisation of the spatial block masking strategy for three example mask centres (red electrodes). The dark to light green spheres represent masks of diameters 40 %, 60 % and 80 % of the head size, as used in our experiments. Assuming a top-down view upon the scalp, the depth of the electrodes is denoted by their intensity (black: close, grey: distant). For a given mask, all electrodes within the corresponding sphere are hidden from the contextual encoder and must be predicted by the predictor.
Figure 3: Downstream classification architectures. In each of the three alternative alterations of the pre-trained networks, two new layers are added. 1) Spatial aggregation is a convolutional layer that realizes weighted combinations of the elements in the channels dimension into $V\ll C$ "virtual" channels. 2) Fully-connected is a linear layer that predicts $c$ class probabilities.
Figure 4: Pre-training curves of the different configurations tested. The solid and dashed lines indicate the loss on the training and validation sets. While the validation loss was tested once per epoch only, the training loss was logged after every optimisation step. The train loss on individual optimisation steps is visible in the background, corresponding epoch-wise averages are outlined in white. A star marks the lowest validation loss per curve, the early stopping time point and consequently the checkpoint from which any fine-tuning started.
Figure 5: Global downstream classification ranking of all the combinations of pre-training configurations and fine-tuning schemes. Each of the three test datasets has 7 subjects and 5 folds per subject, which makes a total of 105 folds. In the legend, the combinations are ordered according to their average rank over all folds. The vertical span of a coloured "pixel" in the plot represents the number of folds in which this configuration has obtained the rank indicated by the x-axis.
...and 1 more figures