PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation
Xinqi Xiong, Andrea Dunn Beltran, Jun Myeong Choi, Marc Niethammer, Roni Sengupta
TL;DR
PPS-Ctrl tackles the sim-to-real gap in colonoscopy depth estimation by fusing Stable Diffusion with ControlNet, conditioned on a Per-Pixel Shading (PPS) map to preserve structure while generating realistic textures. The method employs domain separation via text prompts and a PPS-based encoder–decoder to produce geometry-aware conditioning, achieving superior depth preservation and image realism over GAN-based baselines. Quantitatively, it yields notable improvements in depth metrics (e.g., RMSE, AbsRel, and $\,\delta<1.1$) and image translation quality (lower FID) on cross-domain translations such as SimCol3D→C3VD and C3VD→Colon10K, with PPS plus a ControlDecoder delivering the best results. This approach enhances depth-guided navigation in clinical endoscopy and provides a practical pathway to leverage synthetic data for real-world deployment.
Abstract
Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at https://github.com/anaxqx/PPS-Ctrl.
