Table of Contents
Fetching ...

SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction

Çağhan Köksal, Ghazal Ghazaei, Felix Holm, Azade Farshad, Nassir Navab

TL;DR

This work introduces an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task that leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties.

Abstract

Graph-based holistic scene representations facilitate surgical workflow understanding and have recently demonstrated significant success. However, this task is often hindered by the limited availability of densely annotated surgical scene data. In this work, we introduce an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task. Our approach leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties. We reinforce the initial spatial graph with sparse temporal connections using local matches between consecutive frames to predict temporally consistent clusters across a temporal neighborhood. By jointly optimizing the spatiotemporal relations and node features of the dynamic scene graph with the downstream task of phase segmentation, we address the costly and annotation-burdensome task of semantic scene comprehension and scene graph generation in surgical videos using only weak surgical phase labels. Further, by incorporating effective intermediate scene representation disentanglement steps within the pipeline, our solution outperforms the SOTA on the CATARACTS dataset by 8% accuracy and 10% F1 score in surgical workflow recognition

SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction

TL;DR

This work introduces an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task that leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties.

Abstract

Graph-based holistic scene representations facilitate surgical workflow understanding and have recently demonstrated significant success. However, this task is often hindered by the limited availability of densely annotated surgical scene data. In this work, we introduce an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task. Our approach leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties. We reinforce the initial spatial graph with sparse temporal connections using local matches between consecutive frames to predict temporally consistent clusters across a temporal neighborhood. By jointly optimizing the spatiotemporal relations and node features of the dynamic scene graph with the downstream task of phase segmentation, we address the costly and annotation-burdensome task of semantic scene comprehension and scene graph generation in surgical videos using only weak surgical phase labels. Further, by incorporating effective intermediate scene representation disentanglement steps within the pipeline, our solution outperforms the SOTA on the CATARACTS dataset by 8% accuracy and 10% F1 score in surgical workflow recognition
Paper Structure (10 sections, 1 equation, 4 figures, 5 tables)

This paper contains 10 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Our end-to-end surgical scene graph generation and workflow prediction pipeline, SANGRIA, comprising: 1) Spectral Temporal Clustering converts input frames into a dynamic patchified graph leveraging graph partitioning and local feature matching to produce an initial dynamic scene graph. 2) DSG Optimization optimizes the edge weights between clusters for the end task. 3) Task prediction a GCN-based architecture to predict surgical phases.
  • Figure 2: DSG generation: A patchified input image is fed into DINO to construct a patch-wise affinity matrix. The static patch-based input graphs for neighboring frames within a window of $w$ are temporally linked via sparse matches provided by LightGlue lindenberger2023lightglue. The dynamic patch-based graph is then clustered to predict a DSG for the last frame of the window.
  • Figure 3: A) Comparison of various graph clustering setups (WS corresponds to window size). B) End-to-end optimization improves the classification of 'Primary Knife' since it plays a critical role in predicting the current phase.
  • Figure 4: Comparison of phase segmentation performance on CATARACTS al2019cataracts test videos. A) Ground Truth phases. B) Predictions of the best dynamic model of holm2023dynamic. C) Best model of the SANGRIA(Ours). Our model predicts surgical phases such as Irrigation/Aspiration, OVD Aspiration and Nucleus Breaking more consistently thanks to scene representation and understanding capabilities of the SANGRIA.