PathoPainter: Augmenting Histopathology Segmentation via Tumor-aware Inpainting
Hong Liu, Haosen Yang, Evi M. C. Huijben, Mark Schuiveling, Ruisheng Su, Josien P. W. Pluim, Mitko Veta
TL;DR
PathoPainter tackles data scarcity in histopathology tumor segmentation by reframing synthetic data generation as tumor-aware inpainting conditioned on regional embeddings. It uses a latent diffusion model built on VQ-VAE latents and a self-supervised foreground embedding to produce accurate, mask-aligned tumor regions, with embedding sampling from other images to boost diversity. An adaptive uncertain-region filter removes regions likely to mislead segmentation training, improving robustness. Across DCIS, CATCH, and CAMELYON16, PathoPainter consistently improves segmentation IoU when synthetic data are added and outperforms prior methods, demonstrating practical impact for histopathology data augmentation and segmentation learning.
Abstract
Tumor segmentation plays a critical role in histopathology, but it requires costly, fine-grained image-mask pairs annotated by pathologists. Thus, synthesizing histopathology data to expand the dataset is highly desirable. Previous works suffer from inaccuracies and limited diversity in image-mask pairs, both of which affect training segmentation, particularly in small-scale datasets and the inherently complex nature of histopathology images. To address this challenge, we propose PathoPainter, which reformulates image-mask pair generation as a tumor inpainting task. Specifically, our approach preserves the background while inpainting the tumor region, ensuring precise alignment between the generated image and its corresponding mask. To enhance dataset diversity while maintaining biological plausibility, we incorporate a sampling mechanism that conditions tumor inpainting on regional embeddings from a different image. Additionally, we introduce a filtering strategy to exclude uncertain synthetic regions, further improving the quality of the generated data. Our comprehensive evaluation spans multiple datasets featuring diverse tumor types and various training data scales. As a result, segmentation improved significantly with our synthetic data, surpassing existing segmentation data synthesis approaches, e.g., 75.69% -> 77.69% on CAMELYON16. The code is available at https://github.com/HongLiuuuuu/PathoPainter.
