Improving the Classification Effect of Clinical Images of Diseases for Multi-Source Privacy Protection
Tian Bowen, Xu Zhengyang, Yin Zhihao, Wang Jingying, Yue Yutao
TL;DR
Privacy constraints hinder data sharing across hospitals for medical image analysis. The authors propose a data-vector framework that fine-tunes a common pre-trained network on private data, computes per-site data vectors $\tau_j = \theta_j - \theta_{pre}$, and linearly combines them to form synthetic weights $\theta_{mix} = \theta_{pre} + \tau_{sum}$ without exchanging private data. A fusion model is produced by applying the synthesized vector and a small restoration step to align batch-norm statistics. Experiments on PAD-UFES-20, Retina, and Endoscopic Bladder Tissue show that the data-vector method significantly outperforms single-site fine-tuning and rivals full-data training, while random vectors underperform, illustrating the effectiveness of data-vector driven fusion for privacy-preserving medical AI. The work provides a practical privacy-preserving approach to leverage dispersed medical data and offers theoretical intuition for why parameter mixing improves generalization, with guidance for future exploration.
Abstract
Privacy data protection in the medical field poses challenges to data sharing, limiting the ability to integrate data across hospitals for training high-precision auxiliary diagnostic models. Traditional centralized training methods are difficult to apply due to violations of privacy protection principles. Federated learning, as a distributed machine learning framework, helps address this issue, but it requires multiple hospitals to participate in training simultaneously, which is hard to achieve in practice. To address these challenges, we propose a medical privacy data training framework based on data vectors. This framework allows each hospital to fine-tune pre-trained models on private data, calculate data vectors (representing the optimization direction of model parameters in the solution space), and sum them up to generate synthetic weights that integrate model information from multiple hospitals. This approach enhances model performance without exchanging private data or requiring synchronous training. Experimental results demonstrate that this method effectively utilizes dispersed private data resources while protecting patient privacy. The auxiliary diagnostic model trained using this approach significantly outperforms models trained independently by a single hospital, providing a new perspective for resolving the conflict between medical data privacy protection and model training and advancing the development of medical intelligence.
