Table of Contents
Fetching ...

Identifying Surgical Instruments in Pedagogical Cataract Surgery Videos through an Optimized Aggregation Network

Sanya Sinha, Michal Balazia, Francois Bremond

TL;DR

This work tackles real-time surgical instrument detection in pedagogical cataract videos by introducing Go-ELAN YOLOV9, a detector that fuses Programmable Gradient Information (PGI) with an optimized GELAN backbone to alleviate information bottlenecks during training. On a curated 615-image dataset spanning 10 instrument classes, the model achieves AP $=0.829$ and mAP $=0.723$ at IoU $=0.5$ (and $mAP=0.525$ at IoU $=0.95$), outperforming multiple YOLO variants, DETR, and Laptool. Key contributions include the novel PGI mechanism, the Go-ELAN backbone optimization, and a publicly available annotated dataset of cataract-surgical frames. The results demonstrate practical potential for real-time instrument tracking and can enable applications such as live captioning and enhanced surgical education across ophthalmology training contexts.

Abstract

Instructional cataract surgery videos are crucial for ophthalmologists and trainees to observe surgical details repeatedly. This paper presents a deep learning model for real-time identification of surgical instruments in these videos, using a custom dataset scraped from open-access sources. Inspired by the architecture of YOLOV9, the model employs a Programmable Gradient Information (PGI) mechanism and a novel Generally-Optimized Efficient Layer Aggregation Network (Go-ELAN) to address the information bottleneck problem, enhancing Minimum Average Precision (mAP) at higher Non-Maximum Suppression Intersection over Union (NMS IoU) scores. The Go-ELAN YOLOV9 model, evaluated against YOLO v5, v7, v8, v9 vanilla, Laptool and DETR, achieves a superior mAP of 73.74 at IoU 0.5 on a dataset of 615 images with 10 instrument classes, demonstrating the effectiveness of the proposed model.

Identifying Surgical Instruments in Pedagogical Cataract Surgery Videos through an Optimized Aggregation Network

TL;DR

This work tackles real-time surgical instrument detection in pedagogical cataract videos by introducing Go-ELAN YOLOV9, a detector that fuses Programmable Gradient Information (PGI) with an optimized GELAN backbone to alleviate information bottlenecks during training. On a curated 615-image dataset spanning 10 instrument classes, the model achieves AP and mAP at IoU (and at IoU ), outperforming multiple YOLO variants, DETR, and Laptool. Key contributions include the novel PGI mechanism, the Go-ELAN backbone optimization, and a publicly available annotated dataset of cataract-surgical frames. The results demonstrate practical potential for real-time instrument tracking and can enable applications such as live captioning and enhanced surgical education across ophthalmology training contexts.

Abstract

Instructional cataract surgery videos are crucial for ophthalmologists and trainees to observe surgical details repeatedly. This paper presents a deep learning model for real-time identification of surgical instruments in these videos, using a custom dataset scraped from open-access sources. Inspired by the architecture of YOLOV9, the model employs a Programmable Gradient Information (PGI) mechanism and a novel Generally-Optimized Efficient Layer Aggregation Network (Go-ELAN) to address the information bottleneck problem, enhancing Minimum Average Precision (mAP) at higher Non-Maximum Suppression Intersection over Union (NMS IoU) scores. The Go-ELAN YOLOV9 model, evaluated against YOLO v5, v7, v8, v9 vanilla, Laptool and DETR, achieves a superior mAP of 73.74 at IoU 0.5 on a dataset of 615 images with 10 instrument classes, demonstrating the effectiveness of the proposed model.
Paper Structure (10 sections, 3 equations, 4 figures, 1 table)

This paper contains 10 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Go-ELAN YOLOV9 Complete Architecture. The Auxiliary block works on the Programmable Gradient Information (PGI) concept by creating an auxiliary reverse branch for enabling reliable gradient calculation by avoiding potential semantic loss. The GELAN block in the backbone feature extractor is replaced by the Go-ELAN block proposed in this paper. The Spatial Pyramid Pooling block SPPELAN removes the fixed size limitation of the backbone. The ADown block downsamples the generated feature maps to target sizes. the CBLinear blocks extract higher level features from the images, and the CBFuse block fuses these extracted features. The Neck combines the acquired features and the Head predicts the final bounding bound outputs with their respective probabilities.
  • Figure 2: Go-ELAN Architecture: Size of downsampling filters increases from 128 in GELAN to 512 in Go-ELAN to accommodate greater spatial context. A label smoothener is added in the loss computer to spread out the probability mass.
  • Figure 3: Qualitative Examination of Model Performance. Rows 1 and 3 are labels while 2 and 4 are respective predictions.
  • Figure 4: Qualitative and Quantitative Evaluation of the Model.