Table of Contents
Fetching ...

ColonAdapter: Geometry Estimation Through Foundation Model Adaptation for Colonoscopy

Zhiyi Jiang, Yifu Wang, Xuelian Cheng, Zongyuan Ge

TL;DR

ColonAdapter addresses the challenge of estimating 3D geometry from monocular colonoscopy images by fine-tuning 3D geometric foundation models with a self-supervised framework tailored to colonoscopy. It introduces a Detail Restoration Module to recover fine details and two novel losses—confidence-weighted photometric loss and a geometry-consistency loss—to stabilize training and enforce cross-frame coherence without ground-truth intrinsics. The approach achieves state-of-the-art results in camera pose estimation, monocular depth, and dense point-map reconstruction on synthetic and real colonoscopy data, while remaining intrinsics-free. Ablation studies confirm that DRM, loss terms, and a simple fusion strategy collectively drive performance, though real-time scalability remains a challenge for long sequences.

Abstract

Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.

ColonAdapter: Geometry Estimation Through Foundation Model Adaptation for Colonoscopy

TL;DR

ColonAdapter addresses the challenge of estimating 3D geometry from monocular colonoscopy images by fine-tuning 3D geometric foundation models with a self-supervised framework tailored to colonoscopy. It introduces a Detail Restoration Module to recover fine details and two novel losses—confidence-weighted photometric loss and a geometry-consistency loss—to stabilize training and enforce cross-frame coherence without ground-truth intrinsics. The approach achieves state-of-the-art results in camera pose estimation, monocular depth, and dense point-map reconstruction on synthetic and real colonoscopy data, while remaining intrinsics-free. Ablation studies confirm that DRM, loss terms, and a simple fusion strategy collectively drive performance, though real-time scalability remains a challenge for long sequences.

Abstract

Estimating 3D geometry from monocular colonoscopy images is challenging due to non-Lambertian surfaces, moving light sources, and large textureless regions. While recent 3D geometric foundation models eliminate the need for multi-stage pipelines, their performance deteriorates in clinical scenes. These models are primarily trained on natural scene datasets and struggle with specularity and homogeneous textures typical in colonoscopy, leading to inaccurate geometry estimation. In this paper, we present ColonAdapter, a self-supervised fine-tuning framework that adapts geometric foundation models for colonoscopy geometry estimation. Our method leverages pretrained geometric priors while tailoring them to clinical data. To improve performance in low-texture regions and ensure scale consistency, we introduce a Detail Restoration Module (DRM) and a geometry consistency loss. Furthermore, a confidence-weighted photometric loss enhances training stability in clinical environments. Experiments on both synthetic and real datasets demonstrate that our approach achieves state-of-the-art performance in camera pose estimation, monocular depth prediction, and dense 3D point map reconstruction, without requiring ground-truth intrinsic parameters.

Paper Structure

This paper contains 16 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison of depth map estimations on colonoscopy images between a 3D geometric foundation model (DUSt3R wangDUSt3RGeometric3D2024) and our proposed method. The top row shows how extensive textureless regions cause the model to misinterpret distant areas as close (and vice versa). The middle row highlights how shadows caused by moving light sources and complex anatomical structures lead DUSt3R to erroneously label them as distant. In contrast, the bottom row illustrates that our method accurately reconstructs scenes containing textureless surfaces, dynamic shadows, and non-Lambertian regions, where DUSt3R fails.
  • Figure 2: The training pipeline of our proposed framework, consisting of an adapted foundation module (a), an affiliation module (b), and loss calculation (c). The adapted foundation module takes two input images and generates corresponding point maps along with their confidence maps. The affiliation module, used only during training, provides image brightness calibration, optical flow, and affiliated pose information. With the generated components from these two modules, we reconstruct image and calculate the losses. During evaluation, we rely solely on the adapted foundation module to generate point maps, which are then used to derive camera information, including poses and intrinsic parameters.
  • Figure 3: Architecture of the proposed Detail Restoration Module. The ResNet-18 extracts multi-level features from the input images, which are then fused with ViT features through fusion adapters. The fused features are injected only into the first five layers of the ViT decoder to provide low-level information.
  • Figure 4: Qualitative depth estimation results on the EndoMapper dataset. The top row highlights our method’s ability, enhanced by the integration of DRM, to capture fine structural details (highlighted with red box) that are missed by other approaches. The bottom row shows that even in the presence of unseen artifacts such as bubble, our method still predicts artifact-free geometry.
  • Figure 5: Qualitative comparison of our predicted 3D point maps and the baseline DUSt3R on real colonoscopy images. In the top row, our method successfully recovers a 3D scene from two images containing textureless and non-Lambertian surfaces, while DUSt3R produces a distorted plane. In the bottom row, our method reconstructs the scene geometry, whereas DUSt3R predicts most of the region as a plane.