Table of Contents
Fetching ...

PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation

Xinqi Xiong, Andrea Dunn Beltran, Jun Myeong Choi, Marc Niethammer, Roni Sengupta

TL;DR

PPS-Ctrl tackles the sim-to-real gap in colonoscopy depth estimation by fusing Stable Diffusion with ControlNet, conditioned on a Per-Pixel Shading (PPS) map to preserve structure while generating realistic textures. The method employs domain separation via text prompts and a PPS-based encoder–decoder to produce geometry-aware conditioning, achieving superior depth preservation and image realism over GAN-based baselines. Quantitatively, it yields notable improvements in depth metrics (e.g., RMSE, AbsRel, and $\,\delta<1.1$) and image translation quality (lower FID) on cross-domain translations such as SimCol3D→C3VD and C3VD→Colon10K, with PPS plus a ControlDecoder delivering the best results. This approach enhances depth-guided navigation in clinical endoscopy and provides a practical pathway to leverage synthetic data for real-world deployment.

Abstract

Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.

PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation

TL;DR

PPS-Ctrl tackles the sim-to-real gap in colonoscopy depth estimation by fusing Stable Diffusion with ControlNet, conditioned on a Per-Pixel Shading (PPS) map to preserve structure while generating realistic textures. The method employs domain separation via text prompts and a PPS-based encoder–decoder to produce geometry-aware conditioning, achieving superior depth preservation and image realism over GAN-based baselines. Quantitatively, it yields notable improvements in depth metrics (e.g., RMSE, AbsRel, and ) and image translation quality (lower FID) on cross-domain translations such as SimCol3D→C3VD and C3VD→Colon10K, with PPS plus a ControlDecoder delivering the best results. This approach enhances depth-guided navigation in clinical endoscopy and provides a practical pathway to leverage synthetic data for real-world deployment.

Abstract

Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our method takes a depth map from any synthetic colon datasets as input and generates textures similar to real endoscopy videos while preserving the depth in the generated image. Our proposed pipeline consists of a Stable Diffusion model that can capture image statistics of both real and synthetic domains using text prompts (trained in Stage 1), and a ControlNet that guides the diffusion model to perform depth-preserving image generation through a latent encoding of a Per-Pixel Shading (PPS) map (trained in Stage 2).
  • Figure 2: We calculate error between ground-truth per-pixel shading map and intensity of the original image (Col 4), image generated by depth conditioning (Col 6) and PPS condition (Col 8) to show that PPS conditioning significantly decreases structural inconsistency and closely matches that of the original image.
  • Figure 3: Comparison of depth estimation on the Colon10K dataset using DepthAnything yang2024depth, trained on C3VD$\rightarrow$Colon10K translations from our method ($\hat{D}_{Ours}$) and MI-CycleGAN wang2024structure ($\hat{D}_{MI-cGAN}$), as well as models trained on C3VD alone without translation ($\hat{D}_{Base}$). Our translated data significantly enhances depth estimation compared to MI-CycleGAN and no translation, particularly in the regions highlighted by the black box.
  • Figure 4: Comparison of C3VD $\rightarrow$ Colon10K translation results between our method and MI-CycleGAN wang2024structure, with similar views from Colon10K provided for reference. MI-CycleGAN often fails to preserve depth, introducing incorrect dark textures that mimic down-the-barrel views (Columns 1, 2, 3, 6) and generating unrealistic reflections (Columns 2, 4, 5, 7). Additionally, it produces noticeable checkerboard artifacts (best viewed when zoomed in), which our approach effectively mitigates.