Table of Contents
Fetching ...

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Saptiotemporal Dynamics in Brain Signal Analysis

Amirabbas Hojjati, Lu Li, Ibrahim Hameed, Anis Yazidi, Pedro G. Lind, Rabindra Khadka

Abstract

EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification accuracy. Beyond classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.

From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Saptiotemporal Dynamics in Brain Signal Analysis

Abstract

EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification accuracy. Beyond classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.

Paper Structure

This paper contains 9 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: EEG-VJEPA. EEG signals are transformed into 3D shapes using a sliding window technique. The input passes through 3D convolution to produce patch embeddings with spatiotemporal features. The sequence of flattened tokens is masked before feeding into the X-encoder. Learnable masked tokens with positional embedding are added to the output of the X-encoder. The predictor network outputs embedding vectors for each mask token. The Y-encoder processes the input signal without masking to output the ground truth for the predictor. L1 loss is applied, as recommended in bardes2024revisiting for stability, minimizing the distance between the output of the predictor and the Y-encoder. The parameters of the Y-encoder are updated using the exponential moving average (EMA) of the X-encoder weights.
  • Figure 2: The pre-training loss ofThe LaBraM jiang2024large model is trained over 20 EEG datasets using vector quantized neural spectrum prediction to generate neural vocabularyd from Table \ref{['tab:hyperparameters']}.
  • Figure 3: Inference. The pre-processed EEG signals are input into the pre-trained EEG-VJEPA model. The query token representations from the pre-trained encoder pass through a cross-attention layer, with its output added to the query tokens via a residual connection. Finally, this combined output is fed into a linear classifier.
  • Figure 4: UMAP-based 2D visualization of feature embeddings of EEG-VJEPA. (a) Age-related clusters appear in the embeddings. (b) A global structure emerges in the embeddings, grouped by pathological labels (Abnormal/Normal). (c) The gender-related global structure in the embeddings.
  • Figure 5: EEG-VJEPA locates regions of interest using spatiotemporal tokens. We roll out attention weights through all layers to visualize the attention along the EEG channel and time dimensions in 2D. The corresponding PSD plots show how EEG signal power is distributed across frequency bands, highlighting differences between normal (left side) and abnormal (right side) samples, across five different subjects in each class.