Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

Charlie Budd; Tom Vercauteren

Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

Charlie Budd, Tom Vercauteren

TL;DR

This work shows temporal consistency significantly improves supervised training alone when transferring to the low-data regime of endoscopy, and outperforms the prevalent self-supervision technique for this task.

Abstract

Relative monocular depth, inferring depth up to shift and scale from a single image, is an active research topic. Recent deep learning models, trained on large and varied meta-datasets, now provide excellent performance in the domain of natural images. However, few datasets exist which provide ground truth depth for endoscopic images, making training such models from scratch unfeasible. This work investigates the transfer of these models into the surgical domain, and presents an effective and simple way to improve on standard supervision through the use of temporal consistency self-supervision. We show temporal consistency significantly improves supervised training alone when transferring to the low-data regime of endoscopy, and outperforms the prevalent self-supervision technique for this task. In addition we show our method drastically outperforms the state-of-the-art method from within the domain of endoscopy. We also release our code, model and ensembled meta-dataset, Meta-MED, establishing a strong benchmark for future work.

Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 3 figures, 2 tables)

This paper contains 11 sections, 5 equations, 3 figures, 2 tables.

Introduction
Materials and Methods
Datasets
Fine-tuning Losses
Standard Supervision
Augmentation Consistency Self-supervision
Temporal Consistency Self-supervision
Evaluation
Experiments
Discussion and Conclusion
Acknowledgements

Figures (3)

Figure 1: Example images from the natural image and endoscopic domains with corresponding inverse relative depth-maps. The depth-maps are generated using a MiDaS model pre-trained on natural images. The ability of the model to transfer to the endoscopic domain is notable and speaks to the fundamental nature of the depth estimation task. However, closer inspection reveals significant flaws in the estimated depth-map.
Figure 2: Schematic of our temporal consistency loss. Starting from two temporally close input images at time $t$ and $t+\delta t$ (panel one), the optical flows and depth-maps are inferred (panel two). The flow is then used to calculate a correspondence mask, and to warp the inferred depth map at time $t+\delta t$ to align it with the image at time $t$ (panel three). The pixel-wise error between the warped and original depth map is masked (panel four) and averaged over to provide a final loss.
Figure 3: Examples from our testing dataset. The left most column features the RGB (top) and the sparse ground truth depth (bottom), with the subsequent columns containing the predicted depth (top), which has been fitted to the ground truth as in \ref{['eq-fit']}, and the error (bottom) for a selection of models on a unified colour scale.

Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

TL;DR

Abstract

Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (3)