Distributed Harmonization: Federated Clustered Batch Effect Adjustment and Generalization
Bao Hoang, Yijiang Pang, Siqi Liang, Liang Zhan, Paul Thompson, Jiayu Zhou
TL;DR
Multi-site medical imaging data suffer from site effects that bias measurements and violate i.i.d. assumptions, and standard ComBat cannot generalize to unseen sites without retraining. Cluster ComBat and Distributed Cluster ComBat cluster site patterns and share harmonization parameters across clusters, enabling generalization to new sites and privacy-preserving federated harmonization. Validation on synthetic data and the ADNI brain-imaging dataset shows improved downstream regression performance and robust feature selection compared with baselines. The approach supports scalable, privacy-conscious large-scale multi-site analyses (e.g., ENIGMA) by reducing retraining overhead and facilitating incorporation of new sites.
Abstract
Independent and identically distributed (i.i.d.) data is essential to many data analysis and modeling techniques. In the medical domain, collecting data from multiple sites or institutions is a common strategy that guarantees sufficient clinical diversity, determined by the decentralized nature of medical data. However, data from various sites are easily biased by the local environment or facilities, thereby violating the i.i.d. rule. A common strategy is to harmonize the site bias while retaining important biological information. The ComBat is among the most popular harmonization approaches and has recently been extended to handle distributed sites. However, when faced with situations involving newly joined sites in training or evaluating data from unknown/unseen sites, ComBat lacks compatibility and requires retraining with data from all the sites. The retraining leads to significant computational and logistic overhead that is usually prohibitive. In this work, we develop a novel Cluster ComBat harmonization algorithm, which leverages cluster patterns of the data in different sites and greatly advances the usability of ComBat harmonization. We use extensive simulation and real medical imaging data from ADNI to demonstrate the superiority of the proposed approach. Our codes are provided in https://github.com/illidanlab/distributed-cluster-harmonization.
