Table of Contents
Fetching ...

Model-based Cleaning of the QUILT-1M Pathology Dataset for Text-Conditional Image Synthesis

Marc Aubreville, Jonathan Ganz, Jonas Ammeling, Christopher C. Kaltenecker, Christof A. Bertram

TL;DR

The paper tackles the challenge of quality-heterogeneous data in QUILT-1M for text-conditioned image synthesis. It proposes an automatic impurity-prediction pipeline and semantic alignment filtering to curate image-caption pairs before model training. The impurity classifier reaches 92.71% accuracy, and CLIP-based filtering enhances semantic fidelity; downstream diffusion training on the cleaned data yields better fidelity as measured by conditional FID. This work demonstrates that careful data cleaning of large, publicly sourced histopathology datasets can meaningfully improve generation quality and reliability for medical imaging applications.

Abstract

The QUILT-1M dataset is the first openly available dataset containing images harvested from various online sources. While it provides a huge data variety, the image quality and composition is highly heterogeneous, impacting its utility for text-conditional image synthesis. We propose an automatic pipeline that provides predictions of the most common impurities within the images, e.g., visibility of narrators, desktop environment and pathology software, or text within the image. Additionally, we propose to use semantic alignment filtering of the image-text pairs. Our findings demonstrate that by rigorously filtering the dataset, there is a substantial enhancement of image fidelity in text-to-image tasks.

Model-based Cleaning of the QUILT-1M Pathology Dataset for Text-Conditional Image Synthesis

TL;DR

The paper tackles the challenge of quality-heterogeneous data in QUILT-1M for text-conditioned image synthesis. It proposes an automatic impurity-prediction pipeline and semantic alignment filtering to curate image-caption pairs before model training. The impurity classifier reaches 92.71% accuracy, and CLIP-based filtering enhances semantic fidelity; downstream diffusion training on the cleaned data yields better fidelity as measured by conditional FID. This work demonstrates that careful data cleaning of large, publicly sourced histopathology datasets can meaningfully improve generation quality and reliability for medical imaging applications.

Abstract

The QUILT-1M dataset is the first openly available dataset containing images harvested from various online sources. While it provides a huge data variety, the image quality and composition is highly heterogeneous, impacting its utility for text-conditional image synthesis. We propose an automatic pipeline that provides predictions of the most common impurities within the images, e.g., visibility of narrators, desktop environment and pathology software, or text within the image. Additionally, we propose to use semantic alignment filtering of the image-text pairs. Our findings demonstrate that by rigorously filtering the dataset, there is a substantial enhancement of image fidelity in text-to-image tasks.
Paper Structure (8 sections, 1 figure, 1 table)

This paper contains 8 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Example generations by the model trained on the dataset variants and the FID metric evaluated on 10,000 image crops retrieved from two datasets.