Table of Contents
Fetching ...

Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras

Beilei Cui, Long Bai, Mobarakol Islam, An Wang, Zhiqi Ma, Yiming Huang, Feng Li, Zhen Chen, Zhongliang Jiang, Nassir Navab, Hongliang Ren

TL;DR

This work tackles the challenge of accurate 3D reconstruction from endoscopic video under limited ground-truth data. It introduces Endo3DAC, a unified, self-supervised framework that freezes a foundation-model backbone and uses GDV-LoRA adapters with task-specific decoders to jointly estimate depth $D_t$, relative pose $T_{t\rightarrow s}$, and intrinsics $K$ from monocular videos. A dense 3D reconstruction pipeline then refines depth scales/shifts and optimizes poses to enable TSDF fusion, achieving state-of-the-art results across four public endoscopic datasets with relatively few trainable parameters. The approach demonstrates strong generalization (zero-shot) to unseen organs and cameras, suggesting imminent practical impact for real-time surgical navigation and VR/AR-assisted procedures. Overall, Endo3DAC showcases how selective adaptation of foundation models can yield fast, robust endoscopic 3D scene understanding without requiring ground-truth data during training.

Abstract

Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.

Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras

TL;DR

This work tackles the challenge of accurate 3D reconstruction from endoscopic video under limited ground-truth data. It introduces Endo3DAC, a unified, self-supervised framework that freezes a foundation-model backbone and uses GDV-LoRA adapters with task-specific decoders to jointly estimate depth , relative pose , and intrinsics from monocular videos. A dense 3D reconstruction pipeline then refines depth scales/shifts and optimizes poses to enable TSDF fusion, achieving state-of-the-art results across four public endoscopic datasets with relatively few trainable parameters. The approach demonstrates strong generalization (zero-shot) to unseen organs and cameras, suggesting imminent practical impact for real-time surgical navigation and VR/AR-assisted procedures. Overall, Endo3DAC showcases how selective adaptation of foundation models can yield fast, robust endoscopic 3D scene understanding without requiring ground-truth data during training.

Abstract

Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps' scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.

Paper Structure

This paper contains 46 sections, 15 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Comparison between previous self-supervised depth estimation methods and our proposed Endo3DAC. Previous methods (left) utilize two separate networks to estimate the depth map and the relative pose, which also requires intrinsic parameters for training. In contrast, our proposed method (right) estimates the depth map, relative pose, and camera intrinsic parameters with one integrated network.
  • Figure 2: Illustration of the proposed Endo3DAC SSL depth estimation framework. ViT-based encoder and DPT-liked decoder pre-trained from Depth Anything yang2024depth are employed. We proposed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) to fine-tune one model for different tasks with different sets of parameters. Convolution neck blocks (Conv Neck) are implemented to enhance the network. Only a few of the parameters are trainable (orange) and separate decoder heads are used to predict depth maps, relative poses, and Intrinsics within one network.
  • Figure 3: Illustration of GDV-LoRA Tuning Block. Different sets of parameters are used for the depth estimation task and pose-intrinsic estimation task with a control gate. We use the gradient color and arrows to represent the dynamic variation between training and frozen states.
  • Figure 4: The proposed dense scene reconstruction framework. Given a monocular surgical video, we first use Endo3DAC to generate the depth maps, relative poses, and camera intrinsic parameters. Then, we propose a patch-sampling geometric consistency alignment module to optimize a small number of variables to align the scale and shift among all depth maps. Poses are initialized with estimated poses and optimized concurrently. Finally, we obtain a dense scene reconstruction with the optimized depth maps and relative poses.
  • Figure 5: Qualitative depth comparison on the SCARED, SimCol, Hamlyn, and C3VD datasets. Our method can generate more continuous and reasonable depth maps with clearer edges, especially for the zero-shot performance on Hamlyn and C3VD, showing the great generalization ability of our method. Endo3DAC generates more reasonable and smoother depth maps on Hamlyn and C3VD.
  • ...and 4 more figures