Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction

Gary Sarwin; Alessandro Carretta; Victor Staartjes; Matteo Zoli; Diego Mazzatenta; Luca Regli; Carlo Serra; Ender Konukoglu

Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction

Gary Sarwin, Alessandro Carretta, Victor Staartjes, Matteo Zoli, Diego Mazzatenta, Luca Regli, Carlo Serra, Ender Konukoglu

TL;DR

This work addresses endoscopic neurosurgical localization by developing an unsupervised, anatomy-based approach that constructs a surgical path from video data. It embeds sequences of endoscopic frames into a 3D latent space $oldsymbol{z}=[z^1,z^2,z^3]$, where $z^1$ encodes progress along the path and $z^2,z^3$ describe pitch and yaw, enabling camera-pose estimation without dedicated camera parameters. The method combines YOLOv7-based bounding-box detections with a transformer-encoder autoencoder to learn this latent space and to rotate centered-view detections into observed views via a rotation $oldsymbol{R}_t$, trained with a composite loss on classification, bounding boxes, and pose. Evaluations on 166 real transsphenoidal adenomectomy videos and a Blender synthetic dataset show competitive detection performance ($mAP_{0.5}=0.534$) and precise angle predictions (pitch $0.43^ obreaks ext{ }^ ext{o}$, yaw $0.69^ obreaks ext{ }^ ext{o}$) with high depth correlation ($r=0.97$), demonstrating the feasibility of unsupervised, image-based neuronavigation that provides guidance on endoscope orientation. Limitations include focusing on a single procedure and the need for future work on integrating SLAM and intraoperative MRI guidance, as well as obtaining some labeling for absolute path positions.

Abstract

Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: https://surgicalvision.bmic.ethz.ch.

Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction

TL;DR

, where

encodes progress along the path and

describe pitch and yaw, enabling camera-pose estimation without dedicated camera parameters. The method combines YOLOv7-based bounding-box detections with a transformer-encoder autoencoder to learn this latent space and to rotate centered-view detections into observed views via a rotation

, trained with a composite loss on classification, bounding boxes, and pose. Evaluations on 166 real transsphenoidal adenomectomy videos and a Blender synthetic dataset show competitive detection performance (

) and precise angle predictions (pitch

, yaw

) with high depth correlation (

), demonstrating the feasibility of unsupervised, image-based neuronavigation that provides guidance on endoscope orientation. Limitations include focusing on a single procedure and the need for future work on integrating SLAM and intraoperative MRI guidance, as well as obtaining some labeling for absolute path positions.

Abstract

Paper Structure (10 sections, 1 equation, 3 figures)

This paper contains 10 sections, 1 equation, 3 figures.

Introduction
Methods
Problem Formulation and Approach
Object Detection
Embedding and Camera-Pose
Experiments and Results
Datasets
Implementation Details
Results
Conclusion

Figures (3)

Figure 1: The model comprises an encoder and a decoder that consists of two fully connected networks. The encoder takes $\mathbf{C}_t$ as input and embeds this sequence into a 3D latent representation. The decoder consists of two fully connected networks to generate the class probabilities $\hat{\mathbf{y}}_t$ and the bounding box coordinates $\hat{\mathbf{b}}^I_t$ from $z_t^1$. Furthermore, the encoder outputs $z_t^2$ and $z_t^3$ that are used to construct a rotation matrix to rotate the predicted bounding boxes around the pitch and yaw axes.
Figure 2: An overview of the model used for the creation of the synthetic dataset.
Figure 3: Results are shown that depict the predicted viewing direction of the model for the sequences in the synthetic dataset (row 1), as well as the medical dataset (row 2,3). Additionally, for the medical dataset, their predicted location along the surgical path is shown. The depicted cameras are for illustrative purposes only. Finally, in the bottom row images are shown that are mapped to the same location along the surgical path by the AE. We can see the same anatomical location under different point of views and during different stages of the surgery.

Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction

TL;DR

Abstract

Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)