EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

Beilei Cui; Mobarakol Islam; Long Bai; An Wang; Hongliang Ren

EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

Beilei Cui, Mobarakol Islam, Long Bai, An Wang, Hongliang Ren

TL;DR

This work tackles the challenge of applying foundation-model depth estimation to endoscopy by introducing EndoDAC, a parameter-efficient, self-supervised framework. It combines Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA), Convolution Neck blocks, and a multi-scale decoder within a frozen backbone to adapt a depth foundation model (Depth Anything) to endoscopic scenes, while a Pose-Intrinsics Net jointly estimates ego-motion and camera intrinsics from monocular videos. The approach achieves state-of-the-art SSL depth performance on the SCARED and Hamlyn datasets using only $1.6$M trainable parameters and short training (20 epochs), with real-time inference around $17.7$ ms. This enables robust depth estimation without ground-truth camera intrinsics, broadening applicability to diverse surgical datasets and reducing training costs while maintaining high accuracy and reliable 3D reconstruction capabilities.

Abstract

Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at https://github.com/BeileiCui/EndoDAC.

EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

TL;DR

M trainable parameters and short training (20 epochs), with real-time inference around

ms. This enables robust depth estimation without ground-truth camera intrinsics, broadening applicability to diverse surgical datasets and reducing training costs while maintaining high accuracy and reliable 3D reconstruction capabilities.

Abstract

Paper Structure (12 sections, 6 equations, 6 figures, 5 tables)

This paper contains 12 sections, 6 equations, 6 figures, 5 tables.

Introduction
Method
Preliminaries
Foundation Models for Depth
Low-Rank Adaptation (LoRA) hu2021lora
Proposed Framework: EndoDAC
DepthNet
Pose-Intrinsics Net
Self-supervised Depth and Ego-motion Estimation
Experiments and Results
Results
Conclusion

Figures (6)

Figure 1: Illustration of the proposed Endoscopic Depth Any Camera (EndoDAC) SSL depth estimation framework. ViT-based encoder and DPT-liked decoder pre-trained from Depth Anything yang2024depth are employed for DepthNet. We utilize a small amount of trainable parameters (1.6M) including Dynamic Vector-Based LoRA (DV-LoRA), Convolutional Neck blocks and Multi-Scale Decoders to fine-tune the model. In Pose-Intrinsics Net, ego-motion and camera intrinsic parameters are predicted with the same encoder and separate decoders.
Figure 2: Illustration of (a) Transformer Efficient Tuning Block with DV-LoRA and (b) Convolution Neck block. In DV-LoRA, we use the gradient color and arrows to represent the dynamic variation between training and frozen states.
Figure 3: Qualitative depth comparison on the SCARED dataset.
Figure 4: Qualitative depth comparison on the SCARED dataset.
Figure 5: Qualitative 3D reconstruction comparison on the SCARED dataset.
...and 1 more figures

EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

TL;DR

Abstract

EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

Authors

TL;DR

Abstract

Table of Contents

Figures (6)