Table of Contents
Fetching ...

Robust Polyp Detection and Diagnosis through Compositional Prompt-Guided Diffusion Models

Jia Yu, Yan Zhu, Peiyao Fu, Tianyi Chen, Junbo Huang, Quanlin Li, Pinghong Zhou, Zhihua Wang, Fei Wu, Shuo Wang, Xian Yang

TL;DR

This work tackles generalization gaps in colorectal polyp detection and diagnosis by introducing a Progressive Spectrum Diffusion Model (PSDM) that unifies diverse clinical annotations—segmentation masks, bounding boxes, and colonoscopy reports—into compositional prompts. By organizing prompts into coarse and fine components and guiding diffusion with a frequency-based prompt spectrum, PSDM generates clinically realistic polyps that enhance segmentation, detection, and classification, particularly in out-of-distribution scenarios like PolypGen. The approach uses continual learning with a rehearsal buffer to mitigate forgetting and leverages LLM-derived text attributes and prototype-based prompts to enrich conditioning. Empirical results show PSDM improves F1, mAP, and aesthetics (lower FID) across standard and OOD datasets, supporting its potential to bolster robustness and generalization in clinical practice.

Abstract

Colorectal cancer (CRC) is a significant global health concern, and early detection through screening plays a critical role in reducing mortality. While deep learning models have shown promise in improving polyp detection, classification, and segmentation, their generalization across diverse clinical environments, particularly with out-of-distribution (OOD) data, remains a challenge. Multi-center datasets like PolypGen have been developed to address these issues, but their collection is costly and time-consuming. Traditional data augmentation techniques provide limited variability, failing to capture the complexity of medical images. Diffusion models have emerged as a promising solution for generating synthetic polyp images, but the image generation process in current models mainly relies on segmentation masks as the condition, limiting their ability to capture the full clinical context. To overcome these limitations, we propose a Progressive Spectrum Diffusion Model (PSDM) that integrates diverse clinical annotations-such as segmentation masks, bounding boxes, and colonoscopy reports-by transforming them into compositional prompts. These prompts are organized into coarse and fine components, allowing the model to capture both broad spatial structures and fine details, generating clinically accurate synthetic images. By augmenting training data with PSDM-generated samples, our model significantly improves polyp detection, classification, and segmentation. For instance, on the PolypGen dataset, PSDM increases the F1 score by 2.12% and the mean average precision by 3.09%, demonstrating superior performance in OOD scenarios and enhanced generalization.

Robust Polyp Detection and Diagnosis through Compositional Prompt-Guided Diffusion Models

TL;DR

This work tackles generalization gaps in colorectal polyp detection and diagnosis by introducing a Progressive Spectrum Diffusion Model (PSDM) that unifies diverse clinical annotations—segmentation masks, bounding boxes, and colonoscopy reports—into compositional prompts. By organizing prompts into coarse and fine components and guiding diffusion with a frequency-based prompt spectrum, PSDM generates clinically realistic polyps that enhance segmentation, detection, and classification, particularly in out-of-distribution scenarios like PolypGen. The approach uses continual learning with a rehearsal buffer to mitigate forgetting and leverages LLM-derived text attributes and prototype-based prompts to enrich conditioning. Empirical results show PSDM improves F1, mAP, and aesthetics (lower FID) across standard and OOD datasets, supporting its potential to bolster robustness and generalization in clinical practice.

Abstract

Colorectal cancer (CRC) is a significant global health concern, and early detection through screening plays a critical role in reducing mortality. While deep learning models have shown promise in improving polyp detection, classification, and segmentation, their generalization across diverse clinical environments, particularly with out-of-distribution (OOD) data, remains a challenge. Multi-center datasets like PolypGen have been developed to address these issues, but their collection is costly and time-consuming. Traditional data augmentation techniques provide limited variability, failing to capture the complexity of medical images. Diffusion models have emerged as a promising solution for generating synthetic polyp images, but the image generation process in current models mainly relies on segmentation masks as the condition, limiting their ability to capture the full clinical context. To overcome these limitations, we propose a Progressive Spectrum Diffusion Model (PSDM) that integrates diverse clinical annotations-such as segmentation masks, bounding boxes, and colonoscopy reports-by transforming them into compositional prompts. These prompts are organized into coarse and fine components, allowing the model to capture both broad spatial structures and fine details, generating clinically accurate synthetic images. By augmenting training data with PSDM-generated samples, our model significantly improves polyp detection, classification, and segmentation. For instance, on the PolypGen dataset, PSDM increases the F1 score by 2.12% and the mean average precision by 3.09%, demonstrating superior performance in OOD scenarios and enhanced generalization.

Paper Structure

This paper contains 31 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Compositional prompt-guided diffusion framework for generating diverse polyp images. Left: Example prompts with varying levels of granularity. Right: Previous single-prompt methods constrain diversity. In contrast, our PSDM model employs compositional prompts to enhance augmented dataset diversity, resulting in improvements in downstream tasks.
  • Figure 2: The framework integrates diverse annotations into compositional prompts via the Prompt Spectrum to capture spatial and clinical details (a, Sec. \ref{['prompts']}). During denoising, a conditional U-Net refines latent variables from low to high frequencies, with arrow thickness indicating each component’s contribution (b, Sec. \ref{['diffusion']}).
  • Figure 3: The Multi-Scale Prompt Construction extracts key attributes from medical reports using an LLM, categorizing them into coarse (e.g., size) and fine (e.g., type, pathology, color, border) components.
  • Figure 4: Radar chart illustrating the performance comparison between ResNet models trained on the original imbalanced dataset and the augmented balanced dataset, with exact metric values overlaid.
  • Figure 5: Comparison of Confusion Matrices for Classification Performance the original and augmented dataset performance.
  • ...and 3 more figures