Table of Contents
Fetching ...

Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Shuting Zhao, Chenkang Du, Kristin Qi, Xinrong Chen, Xinhan Di

TL;DR

The paper tackles the challenge of adapting depth foundation models to endoscopic scenes, where previous approaches often search only a low-rank subspace, limiting training dynamics. It introduces a two-stage, full-parameter adaptation that treats convolution, MLP, and attention sub-spaces as separate domains and then fuses them into a unified space: Stage 1 applies low-rank updates $W_i^{stage1} = W_i + B_i A_i$, and Stage 2 uses a gradient-projection bridge to obtain $W_i^{stage2} = \alpha W_i^{stage1} + \beta B_i^{stage1}$ with learnable $\alpha$ and $\beta$. The approach yields significant improvements on the SCARED dataset, with reported reductions in Abs Rel, Sq Rel, RMSE, and RMSE log, and ablations validate the contribution of each sub-space. This framework aims to deliver more accurate and memory-efficient endoscopic depth estimation, paving the way for multi-subspace and multi-model deployments in clinical settings.

Abstract

Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the training dynamics. Therefore, we propose a full-parameter and parameter-efficient learning framework for endoscopic depth estimation. At the first stage, the subspace of attention, convolution and multi-layer perception are adapted simultaneously within different sub-spaces. At the second stage, a memory-efficient optimization is proposed for subspace composition and the performance is further improved in the united sub-space. Initial experiments on the SCARED dataset demonstrate that results at the first stage improves the performance from 10.2% to 4.1% for Sq Rel, Abs Rel, RMSE and RMSE log in the comparison with the state-of-the-art models.

Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

TL;DR

The paper tackles the challenge of adapting depth foundation models to endoscopic scenes, where previous approaches often search only a low-rank subspace, limiting training dynamics. It introduces a two-stage, full-parameter adaptation that treats convolution, MLP, and attention sub-spaces as separate domains and then fuses them into a unified space: Stage 1 applies low-rank updates , and Stage 2 uses a gradient-projection bridge to obtain with learnable and . The approach yields significant improvements on the SCARED dataset, with reported reductions in Abs Rel, Sq Rel, RMSE, and RMSE log, and ablations validate the contribution of each sub-space. This framework aims to deliver more accurate and memory-efficient endoscopic depth estimation, paving the way for multi-subspace and multi-model deployments in clinical settings.

Abstract

Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the training dynamics. Therefore, we propose a full-parameter and parameter-efficient learning framework for endoscopic depth estimation. At the first stage, the subspace of attention, convolution and multi-layer perception are adapted simultaneously within different sub-spaces. At the second stage, a memory-efficient optimization is proposed for subspace composition and the performance is further improved in the united sub-space. Initial experiments on the SCARED dataset demonstrate that results at the first stage improves the performance from 10.2% to 4.1% for Sq Rel, Abs Rel, RMSE and RMSE log in the comparison with the state-of-the-art models.
Paper Structure (7 sections, 4 equations, 2 figures, 2 tables)

This paper contains 7 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Two-Stage Adaption on the Depth Foundation Model cui2024endodac.
  • Figure 2: The first and fourth column represent GT RGB images. The second and the fifth column represent the depth visualization of the state-of-the-art model cui2024endodac. The third and the sixth column represent the depth visualization of the proposed first stage module.