Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation
Shuting Zhao, Chenkang Du, Kristin Qi, Xinrong Chen, Xinhan Di
TL;DR
The paper tackles the challenge of adapting depth foundation models to endoscopic scenes, where previous approaches often search only a low-rank subspace, limiting training dynamics. It introduces a two-stage, full-parameter adaptation that treats convolution, MLP, and attention sub-spaces as separate domains and then fuses them into a unified space: Stage 1 applies low-rank updates $W_i^{stage1} = W_i + B_i A_i$, and Stage 2 uses a gradient-projection bridge to obtain $W_i^{stage2} = \alpha W_i^{stage1} + \beta B_i^{stage1}$ with learnable $\alpha$ and $\beta$. The approach yields significant improvements on the SCARED dataset, with reported reductions in Abs Rel, Sq Rel, RMSE, and RMSE log, and ablations validate the contribution of each sub-space. This framework aims to deliver more accurate and memory-efficient endoscopic depth estimation, paving the way for multi-subspace and multi-model deployments in clinical settings.
Abstract
Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the training dynamics. Therefore, we propose a full-parameter and parameter-efficient learning framework for endoscopic depth estimation. At the first stage, the subspace of attention, convolution and multi-layer perception are adapted simultaneously within different sub-spaces. At the second stage, a memory-efficient optimization is proposed for subspace composition and the performance is further improved in the united sub-space. Initial experiments on the SCARED dataset demonstrate that results at the first stage improves the performance from 10.2% to 4.1% for Sq Rel, Abs Rel, RMSE and RMSE log in the comparison with the state-of-the-art models.
