Table of Contents
Fetching ...

Image Synthesis with Class-Aware Semantic Diffusion Models for Surgical Scene Segmentation

Yihang Zhou, Rebecca Towning, Zaid Awad, Stamatia Giannarou

TL;DR

An evaluation of both image quality and downstream segmentation performance demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image‐map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.

Abstract

Surgical scene segmentation is essential for enhancing surgical precision, yet it is frequently compromised by the scarcity and imbalance of available data. To address these challenges, semantic image synthesis methods based on generative adversarial networks and diffusion models have been developed. However, these models often yield non-diverse images and fail to capture small, critical tissue classes, limiting their effectiveness. In response, we propose the Class-Aware Semantic Diffusion Model (CASDM), a novel approach which utilizes segmentation maps as conditions for image synthesis to tackle data scarcity and imbalance. Novel class-aware mean squared error and class-aware self-perceptual loss functions have been defined to prioritize critical, less visible classes, thereby enhancing image quality and relevance. Furthermore, to our knowledge, we are the first to generate multi-class segmentation maps using text prompts in a novel fashion to specify their contents. These maps are then used by CASDM to generate surgical scene images, enhancing datasets for training and validating segmentation models. Our evaluation, which assesses both image quality and downstream segmentation performance, demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image-map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.

Image Synthesis with Class-Aware Semantic Diffusion Models for Surgical Scene Segmentation

TL;DR

An evaluation of both image quality and downstream segmentation performance demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image‐map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.

Abstract

Surgical scene segmentation is essential for enhancing surgical precision, yet it is frequently compromised by the scarcity and imbalance of available data. To address these challenges, semantic image synthesis methods based on generative adversarial networks and diffusion models have been developed. However, these models often yield non-diverse images and fail to capture small, critical tissue classes, limiting their effectiveness. In response, we propose the Class-Aware Semantic Diffusion Model (CASDM), a novel approach which utilizes segmentation maps as conditions for image synthesis to tackle data scarcity and imbalance. Novel class-aware mean squared error and class-aware self-perceptual loss functions have been defined to prioritize critical, less visible classes, thereby enhancing image quality and relevance. Furthermore, to our knowledge, we are the first to generate multi-class segmentation maps using text prompts in a novel fashion to specify their contents. These maps are then used by CASDM to generate surgical scene images, enhancing datasets for training and validating segmentation models. Our evaluation, which assesses both image quality and downstream segmentation performance, demonstrates the strong effectiveness and generalisability of CASDM in producing realistic image-map pairs, significantly advancing surgical scene segmentation across diverse and challenging datasets.

Paper Structure

This paper contains 6 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The proposed pipeline, illustrating the process from segmentation map generation through semantic image synthesis to downstream segmentation tasks.
  • Figure 2: The architecture of CASDM. The inputs are the left image and the top segmentation map. The left side performs semantic image synthesis, calculating the class-aware MSE loss $\mathcal{L}_{\text{CAMSE}}$. The right side refines the process by comparing the noisy versions of the predicted image and the original input image using the same encoder as the left side, calculating the class-aware self-perceptual loss $\mathcal{L}_{\text{CASP}}$.
  • Figure 3: The architecture of the text-prompted segmentation map generator. The inputs are the left segmentation map and separate text prompts specifying class names, quantities, and locations. The output is the pixel-wise MSE loss $\mathcal{L}_{\text{MSE}}$ on the bottom.
  • Figure 4: Synthetic images generated by compared image synthesis models.
  • Figure 5: Samples of (a) segmentation maps generated by our text-prompted method and (b) segmentation maps included in the CholecSeg8K dataset.
  • ...and 1 more figures