Table of Contents
Fetching ...

Neural Fields for 3D Tracking of Anatomy and Surgical Instruments in Monocular Laparoscopic Video Clips

Beerend G. A. Gerats, Jelmer M. Wolterink, Seb P. Mol, Ivo A. M. J. Broeders

TL;DR

The authors evaluate tracking on video clips from laparoscopic cholecystectomies, where they find mean tracking accuracies of 92.4% for anatomical structures and 87.4% for instruments, showing the feasibility of using neural fields for monocular 3D reconstruction of laparoscopic scenes.

Abstract

Laparoscopic video tracking primarily focuses on two target types: surgical instruments and anatomy. The former could be used for skill assessment, while the latter is necessary for the projection of virtual overlays. Where instrument and anatomy tracking have often been considered two separate problems, in this paper, we propose a method for joint tracking of all structures simultaneously. Based on a single 2D monocular video clip, we train a neural field to represent a continuous spatiotemporal scene, used to create 3D tracks of all surfaces visible in at least one frame. Due to the small size of instruments, they generally cover a small part of the image only, resulting in decreased tracking accuracy. Therefore, we propose enhanced class weighting to improve the instrument tracks. We evaluate tracking on video clips from laparoscopic cholecystectomies, where we find mean tracking accuracies of 92.4% for anatomical structures and 87.4% for instruments. Additionally, we assess the quality of depth maps obtained from the method's scene reconstructions. We show that these pseudo-depths have comparable quality to a state-of-the-art pre-trained depth estimator. On laparoscopic videos in the SCARED dataset, the method predicts depth with an MAE of 2.9 mm and a relative error of 9.2%. These results show the feasibility of using neural fields for monocular 3D reconstruction of laparoscopic scenes.

Neural Fields for 3D Tracking of Anatomy and Surgical Instruments in Monocular Laparoscopic Video Clips

TL;DR

The authors evaluate tracking on video clips from laparoscopic cholecystectomies, where they find mean tracking accuracies of 92.4% for anatomical structures and 87.4% for instruments, showing the feasibility of using neural fields for monocular 3D reconstruction of laparoscopic scenes.

Abstract

Laparoscopic video tracking primarily focuses on two target types: surgical instruments and anatomy. The former could be used for skill assessment, while the latter is necessary for the projection of virtual overlays. Where instrument and anatomy tracking have often been considered two separate problems, in this paper, we propose a method for joint tracking of all structures simultaneously. Based on a single 2D monocular video clip, we train a neural field to represent a continuous spatiotemporal scene, used to create 3D tracks of all surfaces visible in at least one frame. Due to the small size of instruments, they generally cover a small part of the image only, resulting in decreased tracking accuracy. Therefore, we propose enhanced class weighting to improve the instrument tracks. We evaluate tracking on video clips from laparoscopic cholecystectomies, where we find mean tracking accuracies of 92.4% for anatomical structures and 87.4% for instruments. Additionally, we assess the quality of depth maps obtained from the method's scene reconstructions. We show that these pseudo-depths have comparable quality to a state-of-the-art pre-trained depth estimator. On laparoscopic videos in the SCARED dataset, the method predicts depth with an MAE of 2.9 mm and a relative error of 9.2%. These results show the feasibility of using neural fields for monocular 3D reconstruction of laparoscopic scenes.
Paper Structure (11 sections, 3 equations, 5 figures, 2 tables)

This paper contains 11 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our method. A pixel $p_i$ is selected in frame $i$, from where a ray is cast through the virtual scene. A sampled location $x_i$ is mapped to location $u$ in canonical volume $G$. This location is used to predict color $c$ and material density $\sigma$. To predict the translation of this point to frame $j$, the location is mapped via an inverse transform and reprojected to the predicted pixel location $\hat{p}_j$.
  • Figure 2: Examples of tracked pixels (top: gallbladder, bottom: grasper) through 80-frame video clips, with the first, middle and last frame displayed here. The full videos are available via: https://vimeo.com/920225544.
  • Figure 3: Tracks of surgical instruments. Left: first, middle and last frame of the video, with trails following the instrument. Right: instrument tracks visualized in 3D.
  • Figure 4: Performance in 2D tracking for various temporal resolutions.
  • Figure 5: An example frame from the SCARED dataset, with disparity estimated by DPT ranftl2021vision and reconstructed with OmniMotion.