Depth Anything in Medical Images: A Comparative Study
John J. Han, Ayberk Acar, Callahan Henry, Jie Ying Wu
TL;DR
The paper tackles the challenge of estimating depth in medical endoscopic scenes without ground-truth depth data by evaluating the zero-shot Depth Anything Model (DAM) against general-scene and in-domain baselines. It deploys a comprehensive comparison on EndoSLAM and Hamlyn datasets, analyzing both accuracy and inference speed to assess real-time applicability. Key findings indicate that while DAM exhibits impressive zero-shot performance, it does not consistently outperform in-domain or specialized models across all sequences, though its speed is favorable for real-time systems. The work underscores the potential and limitations of foundation-model-based MDE in clinical contexts and motivates further fine-tuning and broader dataset evaluation to enhance generalization to real patient data.
Abstract
Monocular depth estimation (MDE) is a critical component of many medical tracking and mapping algorithms, particularly from endoscopic or laparoscopic video. However, because ground truth depth maps cannot be acquired from real patient data, supervised learning is not a viable approach to predict depth maps for medical scenes. Although self-supervised learning for MDE has recently gained attention, the outputs are difficult to evaluate reliably and each MDE's generalizability to other patients and anatomies is limited. This work evaluates the zero-shot performance of the newly released Depth Anything Model on medical endoscopic and laparoscopic scenes. We compare the accuracy and inference speeds of Depth Anything with other MDE models trained on general scenes as well as in-domain models trained on endoscopic data. Our findings show that although the zero-shot capability of Depth Anything is quite impressive, it is not necessarily better than other models in both speed and performance. We hope that this study can spark further research in employing foundation models for MDE in medical scenes.
