Adaptive Graph Learning from Spatial Information for Surgical Workflow Anticipation
Francis Xiatian Zhang, Jingjing Deng, Robert Lieck, Hubert P. H. Shum
TL;DR
This work tackles surgical workflow anticipation from live video by introducing three core innovations: a bounding-box based spatial representation that includes detection confidence for both instruments and surgical targets, an adaptive graph learning framework that captures dynamic instrument-target interactions, and a multi-horizon objective that unifies predictions across multiple time horizons. The method demonstrates improved short- to mid-term anticipation performance on two benchmarks, with MAE reductions of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration, and shows enhanced robustness to visual artifacts. By integrating stable spatial representations with adaptable relational graphs and a learnable multi-horizon loss, the approach offers more reliable, real-time anticipation that can improve preparation, coordination, and safety in Robotic-Assisted Surgery. The work also provides new annotations for instrument-target interactions to support training and evaluation, and demonstrates practical generalization across surgeries with online inference capabilities.
Abstract
Surgical workflow anticipation is the task of predicting the timing of relevant surgical events from live video data, which is critical in Robotic-Assisted Surgery (RAS). Accurate predictions require the use of spatial information to model surgical interactions. However, current methods focus solely on surgical instruments, assume static interactions between instruments, and only anticipate surgical events within a fixed time horizon. To address these challenges, we propose an adaptive graph learning framework for surgical workflow anticipation based on a novel spatial representation, featuring three key innovations. First, we introduce a new representation of spatial information based on bounding boxes of surgical instruments and targets, including their detection confidence levels. These are trained on additional annotations we provide for two benchmark datasets. Second, we design an adaptive graph learning method to capture dynamic interactions. Third, we develop a multi-horizon objective that balances learning objectives for different time horizons, allowing for unconstrained predictions. Evaluations on two benchmarks reveal superior performance in short-to-mid-term anticipation, with an error reduction of approximately 3% for surgical phase anticipation and 9% for remaining surgical duration anticipation. These performance improvements demonstrate the effectiveness of our method and highlight its potential for enhancing preparation and coordination within the RAS team. This can improve surgical safety and the efficiency of operating room usage.
