Table of Contents
Fetching ...

Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain

Jungwon Choi, Hyungi Lee, Byung-Hoon Kim, Juho Lee

TL;DR

This work tackles label-scarce learning for dynamic functional connectivity by pretraining a spatio-temporal masked autoencoder (ST-JEMA) on large unlabeled fMRI data. It adapts the Joint Embedding Predictive Architecture to graphs by reconstructing latent node and edge representations across space and time, using dual encoders with EMA updates and MLP-Mixer decoders. Across eight downstream rs-fMRI benchmarks, ST-JEMA consistently outperforms static and dynamic baselines on gender, age, and psychiatric diagnosis tasks, with particular strength in data-scarce clinical settings and in scenarios with temporal missing data. The approach demonstrates that leveraging high-level semantic reconstruction of dynamic graphs from unlabeled data yields robust, transfer-ready representations for neuroimaging phenotyping and diagnosis.

Abstract

Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. However, obtaining extensive labeled clinical data for training is often resource-intensive, making practical application difficult. Leveraging unlabeled data thus becomes crucial for representation learning in a label-scarce setting. Although generative self-supervised learning techniques, especially masked autoencoders, have shown promising results in representation learning in various domains, their application to dynamic graphs for dynamic functional connectivity remains underexplored, facing challenges in capturing high-level semantic representations. Here, we introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision. ST-JEMA employs a JEPA-inspired strategy for reconstructing dynamic graphs, which enables the learning of higher-level semantic representations considering temporal perspectives, addressing the challenges in fMRI data representation learning. Utilizing the large-scale UK Biobank dataset for self-supervised learning, ST-JEMA shows exceptional representation learning performance on dynamic functional connectivity demonstrating superiority over previous methods in predicting phenotypes and psychiatric diagnoses across eight benchmark fMRI datasets even with limited samples and effectiveness of temporal reconstruction on missing data scenarios. These findings highlight the potential of our approach as a robust representation learning method for leveraging label-scarce fMRI data.

Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain

TL;DR

This work tackles label-scarce learning for dynamic functional connectivity by pretraining a spatio-temporal masked autoencoder (ST-JEMA) on large unlabeled fMRI data. It adapts the Joint Embedding Predictive Architecture to graphs by reconstructing latent node and edge representations across space and time, using dual encoders with EMA updates and MLP-Mixer decoders. Across eight downstream rs-fMRI benchmarks, ST-JEMA consistently outperforms static and dynamic baselines on gender, age, and psychiatric diagnosis tasks, with particular strength in data-scarce clinical settings and in scenarios with temporal missing data. The approach demonstrates that leveraging high-level semantic reconstruction of dynamic graphs from unlabeled data yields robust, transfer-ready representations for neuroimaging phenotyping and diagnosis.

Abstract

Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks. However, obtaining extensive labeled clinical data for training is often resource-intensive, making practical application difficult. Leveraging unlabeled data thus becomes crucial for representation learning in a label-scarce setting. Although generative self-supervised learning techniques, especially masked autoencoders, have shown promising results in representation learning in various domains, their application to dynamic graphs for dynamic functional connectivity remains underexplored, facing challenges in capturing high-level semantic representations. Here, we introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision. ST-JEMA employs a JEPA-inspired strategy for reconstructing dynamic graphs, which enables the learning of higher-level semantic representations considering temporal perspectives, addressing the challenges in fMRI data representation learning. Utilizing the large-scale UK Biobank dataset for self-supervised learning, ST-JEMA shows exceptional representation learning performance on dynamic functional connectivity demonstrating superiority over previous methods in predicting phenotypes and psychiatric diagnoses across eight benchmark fMRI datasets even with limited samples and effectiveness of temporal reconstruction on missing data scenarios. These findings highlight the potential of our approach as a robust representation learning method for leveraging label-scarce fMRI data.
Paper Structure (44 sections, 33 equations, 6 figures, 9 tables)

This paper contains 44 sections, 33 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA) framework. (a) Pre-training pipeline, where the model jointly reconstructs node representations and edge structures by masking blocks in both spatial and temporal dimensions. The context and target GNN encoders are updated via EMA. Spatial and temporal processes are indicated by green and orange dashed lines, respectively. (b) Fine-tuning pipeline for downstream tasks (classification or regression), where the pre-trained encoder outputs are pooled across spatial and temporal dimensions to generate task-specific representations.
  • Figure 2: Ablation study on the impact of limited labeled samples for the diagnosis classification task. We fine-tuned the pre-trained model using a limited number of samples, randomly sampled from the ABIDE dataset.
  • Figure 3: Overview of dynamic graph construction from data.
  • Figure 4: Ablation results on temporal missing data scenarios on psychiatric diagnosis classification task on ABIDE dataset. We designed temporal missing data scenarios by randomly masking the signal from data along the time axis, adjusting the missing ratio from 30% to 90%, and then measured the score to evaluate the effectiveness of in capturing the temporal dynamics.
  • Figure 5: Ablation study on the impact of sample size for on the ABIDE dataset. We evaluated learned representations for downstream tasks by training a linear probing model across varying numbers of samples.
  • ...and 1 more figures