Image Distillation for Safe Data Sharing in Histopathology
Zhe Li, Bernhard Kainz
TL;DR
This paper tackles safe data sharing in histopathology by replacing real, privacy-sensitive data with a small, human-readable synthetic dataset produced via a class-conditional latent diffusion model. An Infomap-based graph approach selects the most informative 100 images per class from a large synthetic pool, and a joint loss with a contrastive component improves downstream learning from the distilled data. On PathMNIST, the distilled synthetic set achieves competitive accuracy and AUC compared to baselines and small real-data baselines, highlighting practical viability for privacy-preserving collaborative learning. The approach demonstrates that dataset distillation, paired with graph-based sample selection and contrastive training, can enable efficient, privacy-safe data sharing for histopathology applications with potential broader impact on foundation model development.
Abstract
Histopathology can help clinicians make accurate diagnoses, determine disease prognosis, and plan appropriate treatment strategies. As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability and concerns about data sharing and privacy. Federated learning has addressed this challenge by training models locally and updating parameters on a server. However, issues, such as domain shift and bias, persist and impact overall performance. Dataset distillation presents an alternative approach to overcoming these challenges. It involves creating a small synthetic dataset that encapsulates essential information, which can be shared without constraints. At present, this paradigm is not practicable as current distillation approaches only generate non human readable representations and exhibit insufficient performance for downstream learning tasks. We train a latent diffusion model and construct a new distilled synthetic dataset with a small number of human readable synthetic images. Selection of maximally informative synthetic images is done via graph community analysis of the representation space. We compare downstream classification models trained on our synthetic distillation data to models trained on real data and reach performances suitable for practical application.
