Table of Contents
Fetching ...

EndoOmni: Zero-Shot Cross-Dataset Depth Estimation in Endoscopy by Robust Self-Learning from Noisy Labels

Qingyao Tian, Zhen Chen, Huai Liao, Xinyan Huang, Lujie Li, Sebastien Ourselin, Hongbin Liu

TL;DR

EndoOmni addresses the lack of cross-dataset generalization in endoscopic depth estimation by training a foundation model with a robust teacher-student framework that learns from both labeled and unlabeled data. A per-pixel label confidence mechanism and a weighted scale-and-shift invariant loss mitigate label noise, enabling zero-shot depth estimation across diverse endoscopy datasets. The model achieves state-of-the-art zero-shot relative depth estimation on Hamlyn and SERV-CT and provides a strong initialization for fine-tuning metric depth estimation, with transfer to polyp segmentation and bronchoscopy localization. These results demonstrate improved generalization, robustness to noisy medical labels, and practical applicability in real-world endoscopy tasks.

Abstract

Single-image depth estimation is essential for endoscopy tasks such as localization, reconstruction, and augmented reality. Most existing methods in surgical scenes focus on in-domain depth estimation, limiting their real-world applicability. This constraint stems from the scarcity and inferior labeling quality of medical data for training. In this work, we present EndoOmni, the first foundation model for zero-shot cross-domain depth estimation for endoscopy. To harness the potential of diverse training data, we refine the advanced self-learning paradigm that employs a teacher model to generate pseudo-labels, guiding a student model trained on large-scale labeled and unlabeled data. To address training disturbance caused by inherent noise in depth labels, we propose a robust training framework that leverages both depth labels and estimated confidence from the teacher model to jointly guide the student model training. Moreover, we propose a weighted scale-and-shift invariant loss to adaptively adjust learning weights based on label confidence, thus imposing learning bias towards cleaner label pixels while reducing the influence of highly noisy pixels. Experiments on zero-shot relative depth estimation show that our EndoOmni improves state-of-the-art methods in medical imaging for 33\% and existing foundation models for 34\% in terms of absolute relative error on specific datasets. Furthermore, our model provides strong initialization for fine-tuning metric depth estimation, maintaining superior performance in both in-domain and out-of-domain scenarios. The source code is publicly available at https://github.com/TianCuteQY/EndoOmni.

EndoOmni: Zero-Shot Cross-Dataset Depth Estimation in Endoscopy by Robust Self-Learning from Noisy Labels

TL;DR

EndoOmni addresses the lack of cross-dataset generalization in endoscopic depth estimation by training a foundation model with a robust teacher-student framework that learns from both labeled and unlabeled data. A per-pixel label confidence mechanism and a weighted scale-and-shift invariant loss mitigate label noise, enabling zero-shot depth estimation across diverse endoscopy datasets. The model achieves state-of-the-art zero-shot relative depth estimation on Hamlyn and SERV-CT and provides a strong initialization for fine-tuning metric depth estimation, with transfer to polyp segmentation and bronchoscopy localization. These results demonstrate improved generalization, robustness to noisy medical labels, and practical applicability in real-world endoscopy tasks.

Abstract

Single-image depth estimation is essential for endoscopy tasks such as localization, reconstruction, and augmented reality. Most existing methods in surgical scenes focus on in-domain depth estimation, limiting their real-world applicability. This constraint stems from the scarcity and inferior labeling quality of medical data for training. In this work, we present EndoOmni, the first foundation model for zero-shot cross-domain depth estimation for endoscopy. To harness the potential of diverse training data, we refine the advanced self-learning paradigm that employs a teacher model to generate pseudo-labels, guiding a student model trained on large-scale labeled and unlabeled data. To address training disturbance caused by inherent noise in depth labels, we propose a robust training framework that leverages both depth labels and estimated confidence from the teacher model to jointly guide the student model training. Moreover, we propose a weighted scale-and-shift invariant loss to adaptively adjust learning weights based on label confidence, thus imposing learning bias towards cleaner label pixels while reducing the influence of highly noisy pixels. Experiments on zero-shot relative depth estimation show that our EndoOmni improves state-of-the-art methods in medical imaging for 33\% and existing foundation models for 34\% in terms of absolute relative error on specific datasets. Furthermore, our model provides strong initialization for fine-tuning metric depth estimation, maintaining superior performance in both in-domain and out-of-domain scenarios. The source code is publicly available at https://github.com/TianCuteQY/EndoOmni.
Paper Structure (15 sections, 13 equations, 8 figures, 9 tables)

This paper contains 15 sections, 13 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: EndoOmni demonstrates exceptional zero-shot performance across various unseen endoscopy datasets: (a) the OBR dataset ye2016online, (b) the da Vinci surgical dataset ye2017self, (c) the Heico dataset maier2021heidelberg, and (d) our own bronchoscopy data collected from porcine models. The scenes encompass a range of environments, including tubular structures, complex surgical tools and intricate lumen hierarchy.
  • Figure 2: Training framework of EndoOmni: A teacher network is first trained on labeled datasets. Sequentially, a student model is trained on a mix of labeled and unlabeled data, leveraging our robust learning loss guided by the pretrained teacher model. This diverse data and robust training framework enhances the student model's generalization ability.
  • Figure 3: Model learning dynamics. Noisy annotations are created by misaligning image frames with ground truth depth labels. Annotation error is the L1 distance between ground truth and noisy annotations. Clean and noisy pixels are sampled based on annotation error, with loss convergence shown as mean $\pm$ standard deviation. The loss on clean pixels quickly converges, while it remains higher on noisy pixels throughout training.
  • Figure 4: Correlation between pseudo label inconsistency and noise. Pseudo labels are the average predictions from the teacher network across augmentations. Label noise is the SSI difference between ground truth and pseudo labels, while label inconsistency measures the SSI discrepancy between teacher outputs. The binned analysis shows that noise increases significantly in high inconsistency regions (80th percentile threshold = 1.0), indicating that higher inconsistency generally suggests larger pseudo label errors.
  • Figure 5: Zero-shot performance of EndoOmni on SERV-CT (top two rows) and the Hamlyn Dataset (bottom two rows), compared with EndoDAC cui2024endodac, the leading SOTA method for endoscopy, and Depth Anything yang2024depth, the top-performing SOTA foundation model. Quantitative results are also provided for the model without our robust training loss, denoted as SSI. We show corresponding point clouds rendered from the same viewpoint. Misalignments with ground truth are highlighted in boxes on the bottom row for easy identification.
  • ...and 3 more figures