Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

Beilei Cui; Mobarakol Islam; Long Bai; Hongliang Ren

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

Beilei Cui, Mobarakol Islam, Long Bai, Hongliang Ren

TL;DR

This work tackles monocular depth estimation in endoscopic surgery by adapting a large foundation model through adapter learning. Surgical-DINO freezes the DINOv2 encoder and trains lightweight LoRA layers alongside a depth decoder to incorporate surgical domain knowledge, enabling efficient domain adaptation. On the SCARED and Hamlyn datasets, Surgical-DINO achieves state-of-the-art depth accuracy and robust generalization, significantly outperforming zero-shot and naive fine-tuning baselines. The approach demonstrates that foundation models can be effectively repurposed for surgical depth estimation via parameter-efficient adapters, with practical implications for navigation and augmented reality in the operating room, while maintaining a small fraction of trainable parameters $0.14$M out of $86.7$M total parameters.

Abstract

Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

TL;DR

M out of

M total parameters.

Abstract

Paper Structure (17 sections, 6 equations, 3 figures, 5 tables)

This paper contains 17 sections, 6 equations, 3 figures, 5 tables.

INTRODUCTION
METHODOLOGY
Preliminaries
DINOv2
LoRA
Surgical-DINO
LoRA Layers
Network Architecture
Loss functions
EXPERIMENT
Dataset
Implementation Details
Performance metrics
Results
Ablation Studies
...and 2 more sections

Figures (3)

Figure 1: The proposed Surgical-DINO framework. The input image is transformed into tokens by extracting scaled-down patches followed by a linear projection. A positional embedding and a patch-independent class token (red) are used to augment the embedding subsequently. We freeze the image encoder and add trainable LoRA layers to fine-tune the model. We extract tokens from different layers, then up-sample and concatenate them to form the embedding features. Another trainable decode head is used on top of the frozen model to estimate the final depth.
Figure 2: The LoRA design in Surgical-DINO. We apply LoRA only to $q$ and $v$ projection layers in each transformer block. $W_{q}, W_{k}, W_{v}$ and $W_{o}$ denotes the projection layer of $q, k, v$ and $o$ respectively.
Figure 3: Qualitative depth comparison on the SCARED dataset.

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

TL;DR

Abstract

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

Authors

TL;DR

Abstract

Table of Contents

Figures (3)