Improving text-conditioned latent diffusion for cancer pathology
Aakash Madhav Rao, Debayan Gupta
TL;DR
This work addresses the limited availability of labeled cancer histopathology data by proposing PathLDM, a latent-diffusion framework that performs diffusion in a VAE latent space to synthesize high-resolution pathology images conditioned on text. PathLDM combines a VAE encoder/decoder with a time-conditioned diffusion model and uses CLIP-based text embeddings derived from pathology reports to guide generation. The authors show that optimizing the length of textual summaries improves realism, with 35-token summaries achieving a strong balance between relevance and noise and attaining an FID of 21.11 while reducing train-time memory by about 7% relative to prior baselines. Overall, the approach offers a scalable, reproducible path to realistic synthetic histopathology data, with significant implications for data augmentation, education, and automated analysis in cancer pathology.
Abstract
The development of generative models in the past decade has allowed for hyperrealistic data synthesis. While potentially beneficial, this synthetic data generation process has been relatively underexplored in cancer histopathology. One algorithm for synthesising a realistic image is diffusion; it iteratively converts an image to noise and learns the recovery process from this noise [Wang and Vastola, 2023]. While effective, it is highly computationally expensive for high-resolution images, rendering it infeasible for histopathology. The development of Variational Autoencoders (VAEs) has allowed us to learn the representation of complex high-resolution images in a latent space. A vital by-product of this is the ability to compress high-resolution images to space and recover them lossless. The marriage of diffusion and VAEs allows us to carry out diffusion in the latent space of an autoencoder, enabling us to leverage the realistic generative capabilities of diffusion while maintaining reasonable computational requirements. Rombach et al. [2021b] and Yellapragada et al. [2023] build foundational models for this task, paving the way to generate realistic histopathology images. In this paper, we discuss the pitfalls of current methods, namely [Yellapragada et al., 2023] and resolve critical errors while proposing improvements along the way. Our methods achieve an FID score of 21.11, beating its SOTA counterparts in [Yellapragada et al., 2023] by 1.2 FID, while presenting a train-time GPU memory usage reduction of 7%.
