Table of Contents
Fetching ...

Adaptive Graph Learning from Spatial Information for Surgical Workflow Anticipation

Francis Xiatian Zhang, Jingjing Deng, Robert Lieck, Hubert P. H. Shum

TL;DR

This work tackles surgical workflow anticipation from live video by introducing three core innovations: a bounding-box based spatial representation that includes detection confidence for both instruments and surgical targets, an adaptive graph learning framework that captures dynamic instrument-target interactions, and a multi-horizon objective that unifies predictions across multiple time horizons. The method demonstrates improved short- to mid-term anticipation performance on two benchmarks, with MAE reductions of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration, and shows enhanced robustness to visual artifacts. By integrating stable spatial representations with adaptable relational graphs and a learnable multi-horizon loss, the approach offers more reliable, real-time anticipation that can improve preparation, coordination, and safety in Robotic-Assisted Surgery. The work also provides new annotations for instrument-target interactions to support training and evaluation, and demonstrates practical generalization across surgeries with online inference capabilities.

Abstract

Surgical workflow anticipation is the task of predicting the timing of relevant surgical events from live video data, which is critical in Robotic-Assisted Surgery (RAS). Accurate predictions require the use of spatial information to model surgical interactions. However, current methods focus solely on surgical instruments, assume static interactions between instruments, and only anticipate surgical events within a fixed time horizon. To address these challenges, we propose an adaptive graph learning framework for surgical workflow anticipation based on a novel spatial representation, featuring three key innovations. First, we introduce a new representation of spatial information based on bounding boxes of surgical instruments and targets, including their detection confidence levels. These are trained on additional annotations we provide for two benchmark datasets. Second, we design an adaptive graph learning method to capture dynamic interactions. Third, we develop a multi-horizon objective that balances learning objectives for different time horizons, allowing for unconstrained predictions. Evaluations on two benchmarks reveal superior performance in short-to-mid-term anticipation, with an error reduction of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration anticipation. These performance improvements demonstrate the effectiveness of our method and highlight its potential for enhancing preparation and coordination within the RAS team. This can improve surgical safety and the efficiency of operating room usage.

Adaptive Graph Learning from Spatial Information for Surgical Workflow Anticipation

TL;DR

This work tackles surgical workflow anticipation from live video by introducing three core innovations: a bounding-box based spatial representation that includes detection confidence for both instruments and surgical targets, an adaptive graph learning framework that captures dynamic instrument-target interactions, and a multi-horizon objective that unifies predictions across multiple time horizons. The method demonstrates improved short- to mid-term anticipation performance on two benchmarks, with MAE reductions of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration, and shows enhanced robustness to visual artifacts. By integrating stable spatial representations with adaptable relational graphs and a learnable multi-horizon loss, the approach offers more reliable, real-time anticipation that can improve preparation, coordination, and safety in Robotic-Assisted Surgery. The work also provides new annotations for instrument-target interactions to support training and evaluation, and demonstrates practical generalization across surgeries with online inference capabilities.

Abstract

Surgical workflow anticipation is the task of predicting the timing of relevant surgical events from live video data, which is critical in Robotic-Assisted Surgery (RAS). Accurate predictions require the use of spatial information to model surgical interactions. However, current methods focus solely on surgical instruments, assume static interactions between instruments, and only anticipate surgical events within a fixed time horizon. To address these challenges, we propose an adaptive graph learning framework for surgical workflow anticipation based on a novel spatial representation, featuring three key innovations. First, we introduce a new representation of spatial information based on bounding boxes of surgical instruments and targets, including their detection confidence levels. These are trained on additional annotations we provide for two benchmark datasets. Second, we design an adaptive graph learning method to capture dynamic interactions. Third, we develop a multi-horizon objective that balances learning objectives for different time horizons, allowing for unconstrained predictions. Evaluations on two benchmarks reveal superior performance in short-to-mid-term anticipation, with an error reduction of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration anticipation. These performance improvements demonstrate the effectiveness of our method and highlight its potential for enhancing preparation and coordination within the RAS team. This can improve surgical safety and the efficiency of operating room usage.

Paper Structure

This paper contains 29 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Comparison of semantic segmentation (Top) with object detection (Bottom) across three consecutive seconds: We compare our trained YOLOv5 model with Segment Anythingkirillov2023segment, a state-of-the-art foundation model for segmentation. Segmentation masks significantly change across frames even when their positions remain static. In contrast, bounding boxes consistently provide a stable representation of both location and size.
  • Figure 2: Overview of our method. Given a sequence of video frames as input, our model has three main processing stages (Sections \ref{['sec:gi']}--\ref{['sec:aml']}): (1): From the raw frames, we extract bounding boxes of surgical instruments and targets. (2): This information is further processed using adaptive graphs by (2.A) selecting a number of candidate graphs (red, yellow, green), (2.B1) using graph convolution to process the node features based on the graph's connectivity; (2.B2) fusing nodes from the multiple candidate graphs, and (2.B3) performing temporal convolution over the nodes from different video frames. (3): The final node features are used to produce an unconstrained prediction of various surgical events trained using a multi-horizon objective.
  • Figure 3: Additional annotation for existing datasets. For the Cholec80 dataset, we provide additional annotations that focus on surgical targets. For the Cataract101 dataset, we provide additional annotations that focus on surgical targets and surgical instruments.
  • Figure 4: The architecture of our adaptive graph learning consists of two main components: Left: Candidate Graph Selection selects the suitable graph representations for each frame from the most common interactions observed in the training data. Right: Graph-based Feature Learning transforms spatial information into spatio-temporal features for anticipation based on the selected graphs for each frame, and then transforms the feature representation into the final anticipation output.
  • Figure 5: Example of object detection results and graph representation from a cholecystectomytwinanda2016endonet. Left: A frame showing the grasper and hook dissecting the tissue plane. Right: Fully connected candidate graph representing interactions among instruments and surgical targets. Gray nodes represent objects that do not appear in the frame. Node legend: 0: surgical target; 1: grasper; 2: bipolar; 3: hook; 4: scissors; 5: clipper; 6: irrigator; 7: specimen bag.
  • ...and 7 more figures