Table of Contents
Fetching ...

Image Distillation for Safe Data Sharing in Histopathology

Zhe Li, Bernhard Kainz

TL;DR

This paper tackles safe data sharing in histopathology by replacing real, privacy-sensitive data with a small, human-readable synthetic dataset produced via a class-conditional latent diffusion model. An Infomap-based graph approach selects the most informative 100 images per class from a large synthetic pool, and a joint loss with a contrastive component improves downstream learning from the distilled data. On PathMNIST, the distilled synthetic set achieves competitive accuracy and AUC compared to baselines and small real-data baselines, highlighting practical viability for privacy-preserving collaborative learning. The approach demonstrates that dataset distillation, paired with graph-based sample selection and contrastive training, can enable efficient, privacy-safe data sharing for histopathology applications with potential broader impact on foundation model development.

Abstract

Histopathology can help clinicians make accurate diagnoses, determine disease prognosis, and plan appropriate treatment strategies. As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability and concerns about data sharing and privacy. Federated learning has addressed this challenge by training models locally and updating parameters on a server. However, issues, such as domain shift and bias, persist and impact overall performance. Dataset distillation presents an alternative approach to overcoming these challenges. It involves creating a small synthetic dataset that encapsulates essential information, which can be shared without constraints. At present, this paradigm is not practicable as current distillation approaches only generate non human readable representations and exhibit insufficient performance for downstream learning tasks. We train a latent diffusion model and construct a new distilled synthetic dataset with a small number of human readable synthetic images. Selection of maximally informative synthetic images is done via graph community analysis of the representation space. We compare downstream classification models trained on our synthetic distillation data to models trained on real data and reach performances suitable for practical application.

Image Distillation for Safe Data Sharing in Histopathology

TL;DR

This paper tackles safe data sharing in histopathology by replacing real, privacy-sensitive data with a small, human-readable synthetic dataset produced via a class-conditional latent diffusion model. An Infomap-based graph approach selects the most informative 100 images per class from a large synthetic pool, and a joint loss with a contrastive component improves downstream learning from the distilled data. On PathMNIST, the distilled synthetic set achieves competitive accuracy and AUC compared to baselines and small real-data baselines, highlighting practical viability for privacy-preserving collaborative learning. The approach demonstrates that dataset distillation, paired with graph-based sample selection and contrastive training, can enable efficient, privacy-safe data sharing for histopathology applications with potential broader impact on foundation model development.

Abstract

Histopathology can help clinicians make accurate diagnoses, determine disease prognosis, and plan appropriate treatment strategies. As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability and concerns about data sharing and privacy. Federated learning has addressed this challenge by training models locally and updating parameters on a server. However, issues, such as domain shift and bias, persist and impact overall performance. Dataset distillation presents an alternative approach to overcoming these challenges. It involves creating a small synthetic dataset that encapsulates essential information, which can be shared without constraints. At present, this paradigm is not practicable as current distillation approaches only generate non human readable representations and exhibit insufficient performance for downstream learning tasks. We train a latent diffusion model and construct a new distilled synthetic dataset with a small number of human readable synthetic images. Selection of maximally informative synthetic images is done via graph community analysis of the representation space. We compare downstream classification models trained on our synthetic distillation data to models trained on real data and reach performances suitable for practical application.
Paper Structure (6 sections, 2 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 6 sections, 2 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Overview of our InfoDist approach. (a) We train a latent diffusion model UViT bao2023all and generate a synthetic dataset. (b) We extract the image embeddings by a pre-trained convolutional network or UMAP mcinnes2018umap, then use the modified infomap algorithm to detect communities. We select a small synthetic dataset in which images have high modular centrality in each community. (c) We train the classifiers only on the small selected synthetic dataset and apply both cross entropy loss $\mathcal{L}_{ce}$ and contrastive learning loss $\mathcal{L}_{con}$ in training.
  • Figure 2: The real and synthetic samples at different resolutions.