Table of Contents
Fetching ...

EasyVis2: A Real Time Multi-view 3D Visualization System for Laparoscopic Surgery Training Enhanced by a Deep Neural Network YOLOv8-Pose

Yung-Hong Sun, Gefei Shen, Jiangang Chen, Jayer Fernandes, Amber L. Shada, Charles P. Heise, Hongrui Jiang, Yu Hen Hu

TL;DR

EasyVis2 tackles the depth perception gap in laparoscopic surgery by providing real-time multi-view 3D visualization using a five-camera array and markerless tool pose estimation. It extends the EasyVis framework with YOLOv8-Pose to detect 2D tool skeletons per view and employs multi-view triangulation to reconstruct 3D tool poses and render them over a live background, facilitated by a dedicated ST-Pose dataset. A semi-automatic data collection and augmentation strategy enables marker-free training of the 4-point grasper model, achieving high 2D pose precision (e.g., up to $\text{Precision} \approx 0.993$) and improved 3D reconstruction quality, with per-frame processing around $12.6$ ms for five views. The results demonstrate real-time performance, substantial improvements over the baseline EasyVis in 3D reconstruction metrics, and strong potential for deployment in LS training and prospective real-world surgery.

Abstract

EasyVis2 is a system designed to provide hands-free, real-time 3D visualization for laparoscopic surgery. It incorporates a surgical trocar equipped with an array of micro-cameras, which can be inserted into the body cavity to offer an enhanced field of view and a 3D perspective of the surgical procedure. A specialized deep neural network algorithm, YOLOv8-Pose, is utilized to estimate the position and orientation of surgical instruments in each individual camera view. These multi-view estimates enable the calculation of 3D poses of surgical tools, facilitating the rendering of a 3D surface model of the instruments, overlaid on the background scene, for real-time visualization. This study presents methods for adapting YOLOv8-Pose to the EasyVis2 system, including the development of a tailored training dataset. Experimental results demonstrate that, with an identical number of cameras, the new system improves 3D reconstruction accuracy and reduces computation time. Additionally, the adapted YOLOv8-Pose system shows high accuracy in 2D pose estimation.

EasyVis2: A Real Time Multi-view 3D Visualization System for Laparoscopic Surgery Training Enhanced by a Deep Neural Network YOLOv8-Pose

TL;DR

EasyVis2 tackles the depth perception gap in laparoscopic surgery by providing real-time multi-view 3D visualization using a five-camera array and markerless tool pose estimation. It extends the EasyVis framework with YOLOv8-Pose to detect 2D tool skeletons per view and employs multi-view triangulation to reconstruct 3D tool poses and render them over a live background, facilitated by a dedicated ST-Pose dataset. A semi-automatic data collection and augmentation strategy enables marker-free training of the 4-point grasper model, achieving high 2D pose precision (e.g., up to ) and improved 3D reconstruction quality, with per-frame processing around ms for five views. The results demonstrate real-time performance, substantial improvements over the baseline EasyVis in 3D reconstruction metrics, and strong potential for deployment in LS training and prospective real-world surgery.

Abstract

EasyVis2 is a system designed to provide hands-free, real-time 3D visualization for laparoscopic surgery. It incorporates a surgical trocar equipped with an array of micro-cameras, which can be inserted into the body cavity to offer an enhanced field of view and a 3D perspective of the surgical procedure. A specialized deep neural network algorithm, YOLOv8-Pose, is utilized to estimate the position and orientation of surgical instruments in each individual camera view. These multi-view estimates enable the calculation of 3D poses of surgical tools, facilitating the rendering of a 3D surface model of the instruments, overlaid on the background scene, for real-time visualization. This study presents methods for adapting YOLOv8-Pose to the EasyVis2 system, including the development of a tailored training dataset. Experimental results demonstrate that, with an identical number of cameras, the new system improves 3D reconstruction accuracy and reduces computation time. Additionally, the adapted YOLOv8-Pose system shows high accuracy in 2D pose estimation.

Paper Structure

This paper contains 14 sections, 9 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Real-time 3D Rendering framework for surgical tool. a) Laparoscopic surgery tool in the laparoscopic surgery box trainer, a camera array with five cameras captures the video stream. b) Implement surgical tool 2D skeleton estimation for each camera view and each video frame through YOLOv8-Pose. c) Estimate surgical tool 3D skeleton through 3D reconstruction. d) Use Augmented Reality to complete the object surface model. e) Render the reconstructed 3D skeleton with a virtual surgical tool model to any view.
  • Figure 2: EasyVis 3D reconstruction system workflow. This work is based on this framework, focused on improving 2D object detection and pose estimation module.
  • Figure 3: YOLOv8-Pose structure. The neural network first extracts features from the input image in the backbone then detects the object area in the detection head, and then estimates the object pose keypoints in the detected area.
  • Figure 5: One set of samples in the ST-Pose dataset. The captured image simulates the view under a laparoscope, with only the functional head visible. (a) A surgical grasper with a bounding box and object pose is defined by a box and four keypoints. (b) Object mask covering the surgical tool area. (c) Marker mask covering the marker area. (d) The label that describes the object pose states.
  • Figure 6: Demo of data augmentation using masks: a) Original image. b) Data augmentation results. The background is substituted using different random textures and the marker is substituted with rod texture. The model trained with the augmented dataset can avoid treating the background as part of the object and avoid estimating keypoints relying on the marker.
  • ...and 5 more figures