Table of Contents
Fetching ...

Distributed Harmonization: Federated Clustered Batch Effect Adjustment and Generalization

Bao Hoang, Yijiang Pang, Siqi Liang, Liang Zhan, Paul Thompson, Jiayu Zhou

TL;DR

Multi-site medical imaging data suffer from site effects that bias measurements and violate i.i.d. assumptions, and standard ComBat cannot generalize to unseen sites without retraining. Cluster ComBat and Distributed Cluster ComBat cluster site patterns and share harmonization parameters across clusters, enabling generalization to new sites and privacy-preserving federated harmonization. Validation on synthetic data and the ADNI brain-imaging dataset shows improved downstream regression performance and robust feature selection compared with baselines. The approach supports scalable, privacy-conscious large-scale multi-site analyses (e.g., ENIGMA) by reducing retraining overhead and facilitating incorporation of new sites.

Abstract

Independent and identically distributed (i.i.d.) data is essential to many data analysis and modeling techniques. In the medical domain, collecting data from multiple sites or institutions is a common strategy that guarantees sufficient clinical diversity, determined by the decentralized nature of medical data. However, data from various sites are easily biased by the local environment or facilities, thereby violating the i.i.d. rule. A common strategy is to harmonize the site bias while retaining important biological information. The ComBat is among the most popular harmonization approaches and has recently been extended to handle distributed sites. However, when faced with situations involving newly joined sites in training or evaluating data from unknown/unseen sites, ComBat lacks compatibility and requires retraining with data from all the sites. The retraining leads to significant computational and logistic overhead that is usually prohibitive. In this work, we develop a novel Cluster ComBat harmonization algorithm, which leverages cluster patterns of the data in different sites and greatly advances the usability of ComBat harmonization. We use extensive simulation and real medical imaging data from ADNI to demonstrate the superiority of the proposed approach. Our codes are provided in https://github.com/illidanlab/distributed-cluster-harmonization.

Distributed Harmonization: Federated Clustered Batch Effect Adjustment and Generalization

TL;DR

Multi-site medical imaging data suffer from site effects that bias measurements and violate i.i.d. assumptions, and standard ComBat cannot generalize to unseen sites without retraining. Cluster ComBat and Distributed Cluster ComBat cluster site patterns and share harmonization parameters across clusters, enabling generalization to new sites and privacy-preserving federated harmonization. Validation on synthetic data and the ADNI brain-imaging dataset shows improved downstream regression performance and robust feature selection compared with baselines. The approach supports scalable, privacy-conscious large-scale multi-site analyses (e.g., ENIGMA) by reducing retraining overhead and facilitating incorporation of new sites.

Abstract

Independent and identically distributed (i.i.d.) data is essential to many data analysis and modeling techniques. In the medical domain, collecting data from multiple sites or institutions is a common strategy that guarantees sufficient clinical diversity, determined by the decentralized nature of medical data. However, data from various sites are easily biased by the local environment or facilities, thereby violating the i.i.d. rule. A common strategy is to harmonize the site bias while retaining important biological information. The ComBat is among the most popular harmonization approaches and has recently been extended to handle distributed sites. However, when faced with situations involving newly joined sites in training or evaluating data from unknown/unseen sites, ComBat lacks compatibility and requires retraining with data from all the sites. The retraining leads to significant computational and logistic overhead that is usually prohibitive. In this work, we develop a novel Cluster ComBat harmonization algorithm, which leverages cluster patterns of the data in different sites and greatly advances the usability of ComBat harmonization. We use extensive simulation and real medical imaging data from ADNI to demonstrate the superiority of the proposed approach. Our codes are provided in https://github.com/illidanlab/distributed-cluster-harmonization.
Paper Structure (16 sections, 6 equations, 7 figures, 9 tables, 4 algorithms)

This paper contains 16 sections, 6 equations, 7 figures, 9 tables, 4 algorithms.

Figures (7)

  • Figure 1: Graphical model used to generate synthetic data. The shaded circles represent observed variables, including biological covariates and feature values, while unshaded circles represent latent parameters.
  • Figure 2: Synthetic Data: site pattern and label pattern of the raw data.
  • Figure 3: Feature and parameter distribution of synthetic data. Sites within the same cluster in the feature space (as shown in (a)) can also be clustered into the same cluster in the parameter space (as shown in (b)). This indicates that cluster patterns in the feature space can be retained in the parameter space.
  • Figure 4: Synthetic Data: site pattern (Figure \ref{['fig_site_raw']}, \ref{['fig_site_combat']}, \ref{['fig_site_clusterconbat']}) and label pattern (Figure \ref{['fig_label_raw']}, \ref{['fig_label_combat']}, \ref{['fig_label_ClusterComBat']}) after harmonization.
  • Figure 5: LDA plot of brain imaging data by cluster index
  • ...and 2 more figures