Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Zhimin Shao; Jialang Xu; Danail Stoyanov; Evangelos B. Mazomenos; Yueming Jin

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Zhimin Shao, Jialang Xu, Danail Stoyanov, Evangelos B. Mazomenos, Yueming Jin

TL;DR

This work tackles real-time surgical error detection in RMIS by eliminating the need for gesture segmentation and introducing an end-to-end Chain-of-Gesture prompting framework. It combines Gestural-Visual Reasoning (GVR), which injects gesture-context via prompts and visual embeddings, with Multi-Scale Temporal Reasoning (MSTR) that processes slow and fast temporal dynamics through MS-TCN pathways and a prediction-consistency objective. On the JIGSAWS benchmark, the approach achieves notable improvements in frame-level and window-level metrics (around 4–6% relative gains) while maintaining fast per-frame processing (~6.69 ms/frame), demonstrating the value of contextual and temporal multi-scale reasoning for surgical error detection. The method operates without gesture annotations during training, highlighting potential for safer RMIS and enhanced surgical education, with future directions toward semantic error-type detection and remediation guidance.

Abstract

Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 4 figures, 3 tables)

This paper contains 16 sections, 11 equations, 4 figures, 3 tables.

INTRODUCTION
METHODS
Problem Formulation
Gestural-Visual Reasoning (GVR)
Multi-Scale Temporal Reasoning (MSTR)
Prediction Consistency across Multi Scales
Experiments
Datasets and Evaluation Metrics
Implementation Details
Comparison with State-of-the-Art
Ablation Studies
Effectiveness of Key Components
Length of Sequence in GVR
Number of Stages in MSTR
Visual Results
...and 1 more sections

Figures (4)

Figure 1: Illustration on previous methods and our proposed Chain-of-Gesture prompting. (a) Previous methods detect errors with two separate parts: gesture recognition and error detection for each type of gesture. (b) We propose an end-to-end Chain-of-Gesture prompting framework to capture complex visual reasoning processes with two reasoning modules: Gestural-Visual reasoning and Multi-scale Temporal Reasoning.
Figure 2: Overview of our proposed Chain-of-Gesture. (a) Gestural-visual reasoning module with gesture prompts and visual embedding, a transformer layer, and an attention layer for gestural prompting. (b) Multi-scale temporal reasoning module with a slow path and a fast path is optimized by prediction consistency. (c) Temporal Convolutional Network (TCN) in detail. (d) Downsampling in detail.
Figure 3: Analysis of length of sequence $n$ used in GVR. We show the results of the F1 score, Accuracy, and Jaccard of models with different $n$.
Figure 4: Color-coded ribbon illustration for a suturing video clip.

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

TL;DR

Abstract

Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (4)