Table of Contents
Fetching ...

Progressive trajectory matching for medical dataset distillation

Zhen Yu, Yang Liu, Qingchao Chen

TL;DR

A novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets is proposed.

Abstract

It is essential but challenging to share medical image datasets due to privacy issues, which prohibit building foundation models and knowledge transfer. In this paper, we propose a novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets. Existing methods tackle only natural images by randomly matching parts of the training trajectories of the model parameters trained by the whole real datasets. However, through extensive experiments on medical image datasets, the training process is extremely unstable and achieves inferior distillation results. To solve these barriers, we propose to design a novel progressive trajectory matching strategy to improve the training stability for medical image dataset distillation. Additionally, it is observed that improved stability prevents the synthetic dataset diversity and final performance improvements. Therefore, we propose a dynamic overlap mitigation module that improves the synthetic dataset diversity by dynamically eliminating the overlap across different images and retraining parts of the synthetic images for better convergence. Finally, we propose a new medical image dataset distillation benchmark of various modalities and configurations to promote fair evaluations. It is validated that our proposed method achieves 8.33% improvement over previous state-of-the-art methods on average, and 11.7% improvement when ipc=2 (i.e., image per class is 2). Codes and benchmarks will be released.

Progressive trajectory matching for medical dataset distillation

TL;DR

A novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets is proposed.

Abstract

It is essential but challenging to share medical image datasets due to privacy issues, which prohibit building foundation models and knowledge transfer. In this paper, we propose a novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets. Existing methods tackle only natural images by randomly matching parts of the training trajectories of the model parameters trained by the whole real datasets. However, through extensive experiments on medical image datasets, the training process is extremely unstable and achieves inferior distillation results. To solve these barriers, we propose to design a novel progressive trajectory matching strategy to improve the training stability for medical image dataset distillation. Additionally, it is observed that improved stability prevents the synthetic dataset diversity and final performance improvements. Therefore, we propose a dynamic overlap mitigation module that improves the synthetic dataset diversity by dynamically eliminating the overlap across different images and retraining parts of the synthetic images for better convergence. Finally, we propose a new medical image dataset distillation benchmark of various modalities and configurations to promote fair evaluations. It is validated that our proposed method achieves 8.33% improvement over previous state-of-the-art methods on average, and 11.7% improvement when ipc=2 (i.e., image per class is 2). Codes and benchmarks will be released.
Paper Structure (21 sections, 5 equations, 7 figures, 5 tables)

This paper contains 21 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) overview for medical dataset distillation. (b) performance plot of various methods and using different probabilities of sampling the backend trajectories.
  • Figure 2: Overall architectures. Using multiple buffer trajectories, the synthetic images and the labels are used to minimize cross-entropy loss and obtain the orange synthetic student trajectory parameters. By comparing the differences between synthetic and buffer trajectories, the gradients are back-propagated to the synthetic images and they are updated. We propose progressive trajectory matching, by iteratively going back to the original starting points for stable matching. We propose dynamic overlap mitigation and retraining techniques to improve synthetic image diversity.
  • Figure 3: Comparison of the time and memory complexity of our method with a dynamic expert time-step range to other trajectory matching method in the PATHMNIST.
  • Figure 4: Comparison of the stability in the PATHMNIST during distillation and evaluation stages.
  • Figure 5: (a) MMD plots using different overlap mitigation losses in the PATHMNIST. (b) (c) visualization of the performance improvements in the ablation study.
  • ...and 2 more figures