Table of Contents
Fetching ...

SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation

Aditya Bhat, Rupak Bose, Chinedu Innocent Nwoye, Nicolas Padoy

TL;DR

SimGen introduces a diffusion-based framework to jointly generate high-fidelity surgical images and their segmentation masks, addressing the scarcity and annotation burden of surgical data. By incorporating cross-correlation priors and a Canonical Fibonacci Lattice (CFL) for class separability in RGB space, the model achieves improved image quality and mask alignment across six public datasets without adversarial training. The work also defines Semantic Inception Distance (SID) to quantify region-level fidelity, and demonstrates downstream utility for training segmentation models with synthetic data under regulatory constraints. Ablation confirms CFL’s role in stabilizing convergence and preserving class separation, while analyses reveal practical use for data augmentation, domain adaptation, and educational simulations. Overall, SimGen offers a scalable, cost-effective path to paired image–label data in surgical AI, with potential extensions to bounding boxes and video segmentation.

Abstract

Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.

SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation

TL;DR

SimGen introduces a diffusion-based framework to jointly generate high-fidelity surgical images and their segmentation masks, addressing the scarcity and annotation burden of surgical data. By incorporating cross-correlation priors and a Canonical Fibonacci Lattice (CFL) for class separability in RGB space, the model achieves improved image quality and mask alignment across six public datasets without adversarial training. The work also defines Semantic Inception Distance (SID) to quantify region-level fidelity, and demonstrates downstream utility for training segmentation models with synthetic data under regulatory constraints. Ablation confirms CFL’s role in stabilizing convergence and preserving class separation, while analyses reveal practical use for data augmentation, domain adaptation, and educational simulations. Overall, SimGen offers a scalable, cost-effective path to paired image–label data in surgical AI, with potential extensions to bounding boxes and video segmentation.

Abstract

Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.
Paper Structure (25 sections, 17 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: Sample outputs of SimGen across the 6 explored datasets. For each pair, the generated photorealistic image is on left and the generated boundary-aligned segmentation mask is on the right. The mask is colored for visualization with the Canonical Fibonacci Lattice (CFL) function.
  • Figure 2: Illustration of the proposed Fibonacci projection of semantic class identities to 3D hyperspace in comparison with random RGB and grayscale spaces.
  • Figure 3: Architecture of the proposed SimGen model showing its several key components that work together to generate paired image-mask from noise.
  • Figure 4: A random set of 6 generated image-mask pairs from the CholecSeg8K dataset
  • Figure 5: A random set of 6 generated image-mask pairs from the CaDISv2 dataset
  • ...and 12 more figures