Table of Contents
Fetching ...

A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds

Anton Orlichenko, Gang Qu, Ziyu Zhou, Anqi Liu, Hong-Wen Deng, Zhengming Ding, Julia M. Stephen, Tony W. Wilson, Vince D. Calhoun, Yu-Ping Wang

TL;DR

This work introduces DemoVAE, a demographics-conditioned variational autoencoder that decorrelates fMRI functional connectivity (FC) latent features from subject demographics while enabling synthetic data generation conditioned on user-specified demographics. The model extends the VAE objective with multiple loss terms—$L_{Recon}$, $L_{Cov}$, $L_{Mean}$, $L_{Demo}$, and $L_{Guide}$—to enforce a diagonal latent covariance, zero-mean latents, and demographic-consistent sampling, thereby reducing demographic confounds in FC analyses. Trained and validated on the Philadelphia Neurodevelopmental Cohort and BSNIP datasets, DemoVAE can generate high-fidelity synthetic FC data and latent representations that preserve group differences yet minimize demographic leakage, enabling safer data sharing and harmonization. Experimental results show DemoVAE outperforms traditional VAEs and GANs in distributional matching, can recapitulate known demographic-group FC differences, and reduces correlations between FC and a broad set of clinical and demographic fields, with a few remaining associations related to schizophrenia symptoms and medication. Overall, DemoVAE provides a practical framework for demographically controlled FC analysis and synthetic data generation, with implications for data dissemination and demographic-confound mitigation in neuroimaging.

Abstract

Objective: fMRI and derived measures such as functional connectivity (FC) have been used to predict brain age, general fluid intelligence, psychiatric disease status, and preclinical neurodegenerative disease. However, it is not always clear that all demographic confounds, such as age, sex, and race, have been removed from fMRI data. Additionally, many fMRI datasets are restricted to authorized researchers, making dissemination of these valuable data sources challenging. Methods: We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics and generate high-quality synthetic fMRI data based on user-supplied demographics. We train and validate our model using two large, widely used datasets, the Philadelphia Neurodevelopmental Cohort (PNC) and Bipolar and Schizophrenia Network for Intermediate Phenotypes (BSNIP). Results: We find that DemoVAE recapitulates group differences in fMRI data while capturing the full breadth of individual variations. Significantly, we also find that most clinical and computerized battery fields that are correlated with fMRI data are not correlated with DemoVAE latents. An exception are several fields related to schizophrenia medication and symptom severity. Conclusion: Our model generates fMRI data that captures the full distribution of FC better than traditional VAE or GAN models. We also find that most prediction using fMRI data is dependent on correlation with, and prediction of, demographics. Significance: Our DemoVAE model allows for generation of high quality synthetic data conditioned on subject demographics as well as the removal of the confounding effects of demographics. We identify that FC-based prediction tasks are highly influenced by demographic confounds.

A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds

TL;DR

This work introduces DemoVAE, a demographics-conditioned variational autoencoder that decorrelates fMRI functional connectivity (FC) latent features from subject demographics while enabling synthetic data generation conditioned on user-specified demographics. The model extends the VAE objective with multiple loss terms—, , , , and —to enforce a diagonal latent covariance, zero-mean latents, and demographic-consistent sampling, thereby reducing demographic confounds in FC analyses. Trained and validated on the Philadelphia Neurodevelopmental Cohort and BSNIP datasets, DemoVAE can generate high-fidelity synthetic FC data and latent representations that preserve group differences yet minimize demographic leakage, enabling safer data sharing and harmonization. Experimental results show DemoVAE outperforms traditional VAEs and GANs in distributional matching, can recapitulate known demographic-group FC differences, and reduces correlations between FC and a broad set of clinical and demographic fields, with a few remaining associations related to schizophrenia symptoms and medication. Overall, DemoVAE provides a practical framework for demographically controlled FC analysis and synthetic data generation, with implications for data dissemination and demographic-confound mitigation in neuroimaging.

Abstract

Objective: fMRI and derived measures such as functional connectivity (FC) have been used to predict brain age, general fluid intelligence, psychiatric disease status, and preclinical neurodegenerative disease. However, it is not always clear that all demographic confounds, such as age, sex, and race, have been removed from fMRI data. Additionally, many fMRI datasets are restricted to authorized researchers, making dissemination of these valuable data sources challenging. Methods: We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics and generate high-quality synthetic fMRI data based on user-supplied demographics. We train and validate our model using two large, widely used datasets, the Philadelphia Neurodevelopmental Cohort (PNC) and Bipolar and Schizophrenia Network for Intermediate Phenotypes (BSNIP). Results: We find that DemoVAE recapitulates group differences in fMRI data while capturing the full breadth of individual variations. Significantly, we also find that most clinical and computerized battery fields that are correlated with fMRI data are not correlated with DemoVAE latents. An exception are several fields related to schizophrenia medication and symptom severity. Conclusion: Our model generates fMRI data that captures the full distribution of FC better than traditional VAE or GAN models. We also find that most prediction using fMRI data is dependent on correlation with, and prediction of, demographics. Significance: Our DemoVAE model allows for generation of high quality synthetic data conditioned on subject demographics as well as the removal of the confounding effects of demographics. We identify that FC-based prediction tasks are highly influenced by demographic confounds.
Paper Structure (26 sections, 14 equations, 6 figures, 6 tables)

This paper contains 26 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the demographics-conditioned and decorrelated variational autoencoder (DemoVAE) model. Instead of reconstruction based only on latent features $\mathbf{z}=E_\phi(\mathbf{x})$, the DemoVAE model uses demographics $\mathbf{y}$ as input to the decoder $\hat{\mathbf{x}}=D_\theta(\mathbf{z},\mathbf{y})$. The two main uses of the model are inference, which generates latent features $\mathbf{z}$ decorrelated from demographics, and sampling, which generates synthetic fMRI data conditioned on user-provided demographics.
  • Figure 2: Histogram of standardized (age-correct) WRAT score from the PNC dataset, split among the two major race groups in the dataset. There is a clear demographic confound when predicting WRAT score from fMRI or genomic data. We show in Table \ref{['tab:wrat-pred']} that DemoVAE is able to remove the effect of this confound, but at the same time, removes the ability to accurately predict WRAT score.
  • Figure 3: Sampled FC matrices for real PNC resting state scans (top) compared to synthetic DemoVAE, VAE, and W-GAN FC data. Visually, all synthetic models generate convincing data.
  • Figure 4: t-SNE embeddings of synthetic FC data from DemoVAE, traditional VAE, and W-GAN models overlayed on top of t-SNE embeddings of real resting state FC data from the PNC dataset. Blue circles represent embeddings of real subject FC data while orange crosses represent embeddings of synthetic data. We see that DemoVAE captures the distribution of fMRI FC data as well as or better than a traditional VAE and better than a GAN.
  • Figure 5: Group FC differences using real data and synthetic data generated by DemoVAE conditioned on appropriate demographic input. Top: synthetic DemoVAE data, bottom: real data. From left to right, we see that DemoVAE qualitatively recapitulates group differences in the PNC (mean, age, sex, race) and BSNIP (SZ diagnosis) datasets. Arrows point out FC features in real data that are reproduced in synthetic DemoVAE samples. Brain functional networks for the Power atlas, shown left to right and top to bottom in FC matrices, are given in Table \ref{['tab:bfns']}.
  • ...and 1 more figures