Vision-Based Neurosurgical Guidance: Unsupervised Localization and Camera-Pose Prediction
Gary Sarwin, Alessandro Carretta, Victor Staartjes, Matteo Zoli, Diego Mazzatenta, Luca Regli, Carlo Serra, Ender Konukoglu
TL;DR
This work addresses endoscopic neurosurgical localization by developing an unsupervised, anatomy-based approach that constructs a surgical path from video data. It embeds sequences of endoscopic frames into a 3D latent space $oldsymbol{z}=[z^1,z^2,z^3]$, where $z^1$ encodes progress along the path and $z^2,z^3$ describe pitch and yaw, enabling camera-pose estimation without dedicated camera parameters. The method combines YOLOv7-based bounding-box detections with a transformer-encoder autoencoder to learn this latent space and to rotate centered-view detections into observed views via a rotation $oldsymbol{R}_t$, trained with a composite loss on classification, bounding boxes, and pose. Evaluations on 166 real transsphenoidal adenomectomy videos and a Blender synthetic dataset show competitive detection performance ($mAP_{0.5}=0.534$) and precise angle predictions (pitch $0.43^ obreaks ext{ }^ ext{o}$, yaw $0.69^ obreaks ext{ }^ ext{o}$) with high depth correlation ($r=0.97$), demonstrating the feasibility of unsupervised, image-based neuronavigation that provides guidance on endoscope orientation. Limitations include focusing on a single procedure and the need for future work on integrating SLAM and intraoperative MRI guidance, as well as obtaining some labeling for absolute path positions.
Abstract
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: https://surgicalvision.bmic.ethz.ch.
