Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

Jie Zhang; Song Zhou; Yiwei Wang; Chidan Wan; Huan Zhao; Xiong Cai; Han Ding

Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

Jie Zhang, Song Zhou, Yiwei Wang, Chidan Wan, Huan Zhao, Xiong Cai, Han Ding

TL;DR

This work tackles Primary Intention (PI) recognition in laparoscopic procedures by introducing a grammar-augmented framework that blends top-down surgical activity grammar with bottom-up visual cues. It models surgical activities as SP-AOG, a PCFG with And/Or decomposition, and uses a three-stage pipeline: a DNN-based PI probability matrix (P*), grammar induction via ADIOS to obtain a grammar G*, and parsing with Generalized Earley Parser (GEP) to infer the PI sequence A*. On the CholecPI dataset derived from CholecT50, grammar-augmented models consistently outperform state-of-the-art vision-only detectors across micro accuracy and weighted F1 metrics, with RDV+$\mathcal{G}_{10}$ achieving the strongest overall performance. The results demonstrate the value of hierarchical grammar in surgical workflow understanding, enabling improved planning and automation for robotic surgery and suggesting avenues for combining grammar with advanced planning tools, including potential LLM-guided robot planning.

Abstract

Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.

Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

TL;DR

achieving the strongest overall performance. The results demonstrate the value of hierarchical grammar in surgical workflow understanding, enabling improved planning and automation for robotic surgery and suggesting avenues for combining grammar with advanced planning tools, including potential LLM-guided robot planning.

Abstract

Paper Structure (21 sections, 12 equations, 4 figures, 2 tables)

This paper contains 21 sections, 12 equations, 4 figures, 2 tables.

Introduction
Related works
Surgical Action recognition
Activity grammar
Methodology
Preliminaries of activity grammar
Involving grammar into classification model
Primary Intention Recognition
Probability matrix acquisition
SP-AOG learning
Primary Intention parse via GEP
Experiments and results
Experimental setup
Dataset
Metric
...and 6 more sections

Figures (4)

Figure 1: Top: the research target from broader action triplets < tool, verb, target> (left) to more specific primary intentions (center), and to the proposed primary intentions (PIs) (right). Bottom: recognition methods from using each single frame (left) to integrating continuous verb information (center), and to the proposed top-down method based on the surgical activity grammar (right).
Figure 2: The proposed framework for PI recognition. (a) The surgical activity grammar, SP-AOG, is developed by statistical learning from a corpus of surgical procedure recordings, highlighting the hierarchical relationships and dependencies among PIs. (b) SP-AOG is then used to parse a probability matrix, generated by a classification model, that indicates the likelihood of each PI category. (c) The parsing process identifies the optimal sequence of PIs that aligns with the grammar and predicted probabilities.
Figure 3: SP-AOG learned from a corpus of 10 surgeries. The pink nodes represent And-nodes, while the purple nodes signify Or-nodes. The numbers on the branching edges of Or-nodes indicate branching probability, and the bracketed numbers on And-node edges denote the order of expansion.
Figure 4: Visualization of refining PI recognition results from baseline models using surgical activity grammar. A grammar is induced to optimize Triplet, RDV, and RiT predictions for each PI class across sequential frames. The highest probability column in the probability matrix (columns represent PI0-PI6) indicates PI category predicted by each baseline model. White boxes highlight the PI categories that have been refined based on the surgical grammar.

Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

TL;DR

Abstract

Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures

Authors

TL;DR

Abstract

Table of Contents

Figures (4)