Table of Contents
Fetching ...

Latent Diffusion Models with Image-Derived Annotations for Enhanced AI-Assisted Cancer Diagnosis in Histopathology

Pedro Osorio, Guillermo Jimenez-Perez, Javier Montalt-Tordera, Jens Hooge, Guillem Duran-Ballester, Shivam Singh, Moritz Radbruch, Ute Bach, Sabrina Schroeder, Krystyna Siudak, Julia Vienenkoetter, Bettina Lawrenz, Sadegh Mohammadi

TL;DR

This work tackles data scarcity in histopathology cancer diagnosis by training latent diffusion models (LDMs) with image-derived prompts to synthesize histology patches. It introduces a morphology-enriched prompt-building workflow that leverages DiNO-ViT embeddings and K-means clustering into 33 morphology groups to generate 66 prompts, substantially improving synthetic-data fidelity and coverage (e.g., FID dropping from $178.8$ to $90.2$) and enabling better downstream performance when training with synthetic data alone (AUC improving to $0.805$). Pathologist evaluation shows synthetic patches are largely indistinguishable from real ones, underscoring potential for data sharing and privacy-preserving augmentation. The results demonstrate that synthetic data, especially when guided by image-derived morphology cues, can meaningfully augment small real datasets and reduce data-collection costs for cancer CAD in digital pathology.

Abstract

Artificial Intelligence (AI) based image analysis has an immense potential to support diagnostic histopathology, including cancer diagnostics. However, developing supervised AI methods requires large-scale annotated datasets. A potentially powerful solution is to augment training data with synthetic data. Latent diffusion models, which can generate high-quality, diverse synthetic images, are promising. However, the most common implementations rely on detailed textual descriptions, which are not generally available in this domain. This work proposes a method that constructs structured textual prompts from automatically extracted image features. We experiment with the PCam dataset, composed of tissue patches only loosely annotated as healthy or cancerous. We show that including image-derived features in the prompt, as opposed to only healthy and cancerous labels, improves the Fréchet Inception Distance (FID) from 178.8 to 90.2. We also show that pathologists find it challenging to detect synthetic images, with a median sensitivity/specificity of 0.55/0.55. Finally, we show that synthetic data effectively trains AI models.

Latent Diffusion Models with Image-Derived Annotations for Enhanced AI-Assisted Cancer Diagnosis in Histopathology

TL;DR

This work tackles data scarcity in histopathology cancer diagnosis by training latent diffusion models (LDMs) with image-derived prompts to synthesize histology patches. It introduces a morphology-enriched prompt-building workflow that leverages DiNO-ViT embeddings and K-means clustering into 33 morphology groups to generate 66 prompts, substantially improving synthetic-data fidelity and coverage (e.g., FID dropping from to ) and enabling better downstream performance when training with synthetic data alone (AUC improving to ). Pathologist evaluation shows synthetic patches are largely indistinguishable from real ones, underscoring potential for data sharing and privacy-preserving augmentation. The results demonstrate that synthetic data, especially when guided by image-derived morphology cues, can meaningfully augment small real datasets and reduce data-collection costs for cancer CAD in digital pathology.

Abstract

Artificial Intelligence (AI) based image analysis has an immense potential to support diagnostic histopathology, including cancer diagnostics. However, developing supervised AI methods requires large-scale annotated datasets. A potentially powerful solution is to augment training data with synthetic data. Latent diffusion models, which can generate high-quality, diverse synthetic images, are promising. However, the most common implementations rely on detailed textual descriptions, which are not generally available in this domain. This work proposes a method that constructs structured textual prompts from automatically extracted image features. We experiment with the PCam dataset, composed of tissue patches only loosely annotated as healthy or cancerous. We show that including image-derived features in the prompt, as opposed to only healthy and cancerous labels, improves the Fréchet Inception Distance (FID) from 178.8 to 90.2. We also show that pathologists find it challenging to detect synthetic images, with a median sensitivity/specificity of 0.55/0.55. Finally, we show that synthetic data effectively trains AI models.
Paper Structure (22 sections, 1 equation, 6 figures, 5 tables)

This paper contains 22 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Randomly selected subset of 25 samples for: (a) the real dataset; (b) the synthetic set generated by Stable Diffusion (SD) out-of-the-box without fine-tuning; (c) the synthetic set generated by an SD model fine-tuned on histopathology data using a naïve prompt-building approach; and (d) the synthetic set generated by an SD model fine-tuned on histopathology data using our proposed prompt-building approach that leverages semantic information for improved generative diversity. Image grids are categorized per label (healthy, cancer).
  • Figure 2: Overview of the pipeline proposed in this work. The inputs (a) consist in (image, label) pairs as curated from Patch Camelyon (PCam). The prompt-building pipeline (b) takes said inputs to construct a prompt (or caption) that describes each image in the input dataset. For this purpose, two approaches are followed: the baseline approach, in which only the label is used to generate a textual descriptor for the image; and the morphology-enriched approach, in which a frozen image embedder (DiNO caron_emerging_2021) is used in combination with the patch's label to automatically extract semantic features from the image (clustered into 33 morphology types), to generate a morphology-rich prompt. After prompt-building, Stable Diffusion (c), an open-source Latent Diffusion Model (LDM), is trained using either of the prompt-building approaches from (b). Stable diffusion is based on a variational autoencoder (VAE) and a UNet. The VAE uses its encoder (E) to reduce the dimensionality of the input image into a latent ($z_0$), and can recover full-resolution images using its decoder (D). The VAE's latent ($z_0$) is used by the UNet, alongside the information in the prompt (via CLIP, a textual embedding model) to generate synthetic images. After model training, the performance of the fine-tuned stable diffusion model is evaluated on a series of downstream tasks (d). For this purpose, a large array of synthetic images are generated and tested using a visual Turing test, standard image quality metrics, and two classification approaches. The snowflake icon corresponds to a frozen model.
  • Figure 3: Randomly selected subset of 25 samples for the real (a), baseline (b) and morphology-enriched (c) datasets. The rightmost column depicts the coverage comparison of the real data distribution between the two synthetic datasets. The manifold representation is generated based on Inception-v3 latents with a 2D UMAP transform.
  • Figure 4: (a) UMAP Embedding Distributions: the real data distribution (blue) is better covered by the morphology-enriched prompt-building (black), as compared to the baseline prompts (red). Overlaid on the figure are regions R1 and R2 (in green), which are captured by both prompt-building approaches, and for which examples are selected in (b). Also overlaid are regions A through F (in red), which represents regions not captured by the baseline approach but well represented by the morphology-enriched prompt-building, with selected examples being depicted in (c). It is to note that the baseline approach is prone to visual outliers (e.g. green tincture in region F and darker images in region A). Zoom for detail.
  • Figure 5: Reader agreement analysis results. (a) Real and synthetic images highlighting the resampling artifacts visible on real but not on synthetic images. Synthetic images showed a smoother visual appearance. (b) Selected true positive and false positive examples with the highest inter-reader agreement. Slightly higher inter-reader agreement was found when the ground truth was synthetic, irrespective of whether the majority reader decision was a true positive or not. (c) Inter-reader reliability based on pairwise Cohen's Kappa coefficients for readers' label decisions on real (top) and synthetic images (bottom). The overall agreement for each of these scenarios is reported as the mean ($\mu$) and standard deviation ($\sigma$) of the kappa coefficients in the off-diagonal.
  • ...and 1 more figures