Table of Contents
Fetching ...

Exploring Semantic Consistency in Unpaired Image Translation to Generate Data for Surgical Applications

Danush Kumar Venkatesh, Dominik Rivoir, Micha Pfeiffer, Fiona Kolbinger, Marius Distler, Jürgen Weitz, Stefanie Speidel

TL;DR

This work tackles semantic distortion in unpaired image translation for surgical data augmentation by introducing ConStructS, a framework that jointly employs PatchNCE-based contrastive learning and a multi-scale MS-SSIM semantic loss to preserve structural content during translation from synthetic to real surgical domains. Through extensive experiments on cholecystectomy and gastrectomy datasets, ConStructS demonstrates improved semantic consistency and increased utility of translated images for downstream segmentation, compared to multiple baselines. The findings show that a simple yet effective loss combination can outperform more complex architectures in preserving anatomy while maintaining realism, enabling more reliable synthetic data generation for data-scarce medical applications. This approach has practical impact for training surgical perception models where labeled data are scarce or privacy-constrained, and it highlights a promising direction toward semantically robust data synthesis in medical imaging.

Abstract

In surgical computer vision applications, obtaining labeled training data is challenging due to data-privacy concerns and the need for expert annotation. Unpaired image-to-image translation techniques have been explored to automatically generate large annotated datasets by translating synthetic images to the realistic domain. However, preserving the structure and semantic consistency between the input and translated images presents significant challenges, mainly when there is a distributional mismatch in the semantic characteristics of the domains. This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications, explicitly focusing on semantic consistency. We extensively evaluate various state-of-the-art image translation models on two challenging surgical datasets and downstream semantic segmentation tasks. We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results. Quantitatively, we show that the data generated with this approach yields higher semantic consistency and can be used more effectively as training data.The code is available at https://gitlab.com/nct_tso_public/constructs.

Exploring Semantic Consistency in Unpaired Image Translation to Generate Data for Surgical Applications

TL;DR

This work tackles semantic distortion in unpaired image translation for surgical data augmentation by introducing ConStructS, a framework that jointly employs PatchNCE-based contrastive learning and a multi-scale MS-SSIM semantic loss to preserve structural content during translation from synthetic to real surgical domains. Through extensive experiments on cholecystectomy and gastrectomy datasets, ConStructS demonstrates improved semantic consistency and increased utility of translated images for downstream segmentation, compared to multiple baselines. The findings show that a simple yet effective loss combination can outperform more complex architectures in preserving anatomy while maintaining realism, enabling more reliable synthetic data generation for data-scarce medical applications. This approach has practical impact for training surgical perception models where labeled data are scarce or privacy-constrained, and it highlights a promising direction toward semantically robust data synthesis in medical imaging.

Abstract

In surgical computer vision applications, obtaining labeled training data is challenging due to data-privacy concerns and the need for expert annotation. Unpaired image-to-image translation techniques have been explored to automatically generate large annotated datasets by translating synthetic images to the realistic domain. However, preserving the structure and semantic consistency between the input and translated images presents significant challenges, mainly when there is a distributional mismatch in the semantic characteristics of the domains. This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications, explicitly focusing on semantic consistency. We extensively evaluate various state-of-the-art image translation models on two challenging surgical datasets and downstream semantic segmentation tasks. We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results. Quantitatively, we show that the data generated with this approach yields higher semantic consistency and can be used more effectively as training data.The code is available at https://gitlab.com/nct_tso_public/constructs.
Paper Structure (32 sections, 9 equations, 11 figures, 9 tables)

This paper contains 32 sections, 9 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Generation of realistic data from synthetic surgical images with unpaired image translation method. The semantic mismatch between domains can lead to inconsistent translations, like blood texture (red color) getting mapped onto different structures (highlighted in white boxes). Some regions with consistent semantic translation are indicated in yellow boxes.
  • Figure 2: The overview of the ConStructS model with different loss functions.
  • Figure 3: Qualitative results of various translation methods on the cholecystectomy dataset. At the junction of two structures, the textures were interchanged in most of the models. Although not solved completely, the ConStructS model reduces semantic inconsistency. Some regions are highlighted in white boxes.
  • Figure 4: Qualitative samples from the gastrectomy dataset. The white boxes highlight some regions. The red box indicates one of the failure cases of ConStructS, where a tool-like texture is mapped on the liver.
  • Figure 5: Qualitative results of the ablation study on the cholecystectomy dataset. Texture mismatch occurs in low-lighting regions without the semantic loss. As seen from the $2^{nd}$ row without the PatchNCE loss, no explicit boundary exists between the liver and abdominal wall leading to both regions having the same semantic textures.
  • ...and 6 more figures