Table of Contents
Fetching ...

Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

Firdavs Nasriddinov, Rafal Kocielnik, Anima Anandkumar, Andrew J. Hung

TL;DR

This work introduces a structure-aware pipeline that grounds natural-language surgical feedback in Instrument–Action–Tissue (IAT) triplets derived from real trainer–trainee transcripts. By fusing video frames, temporal instrument motion, and procedure/task context, the system predicts IAT triplets which then condition GPT-4o to generate trainer-style, clinically grounded feedback, with an uncertainty gate to reduce hallucinations. Across Task 1 (Video→IAT) and Task 2 (Feedback Generation), integrating IAT structure and motion context yields consistent performance gains in IAT recognition (AUC) and fidelity of generated feedback (higher clinician-aligned scores, lower WER, higher ROUGE), supported by a clinician-aligned evaluation protocol. The approach enables auditable, scalable surgical coaching and provides a data-efficient representation by grounding content in interpretable IAT semantics, setting the stage for broader adoption in clinical training and simulation settings.

Abstract

High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.

Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

TL;DR

This work introduces a structure-aware pipeline that grounds natural-language surgical feedback in Instrument–Action–Tissue (IAT) triplets derived from real trainer–trainee transcripts. By fusing video frames, temporal instrument motion, and procedure/task context, the system predicts IAT triplets which then condition GPT-4o to generate trainer-style, clinically grounded feedback, with an uncertainty gate to reduce hallucinations. Across Task 1 (Video→IAT) and Task 2 (Feedback Generation), integrating IAT structure and motion context yields consistent performance gains in IAT recognition (AUC) and fidelity of generated feedback (higher clinician-aligned scores, lower WER, higher ROUGE), supported by a clinician-aligned evaluation protocol. The approach enables auditable, scalable surgical coaching and provides a data-efficient representation by grounding content in interpretable IAT semantics, setting the stage for broader adoption in clinical training and simulation settings.

Abstract

High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer-to-trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by (1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, (2) fine-tuning a video-to-IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion, and (3) demonstrating how to effectively use IAT triplet representations to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video-to-IAT recognition, our context injection and temporal tracking deliver consistent AUC gains (Instrument: 0.67 to 0.74; Action: 0.60 to 0.63; Tissue: 0.74 to 0.79). For Task 2: feedback text generation (rated on a 1-5 fidelity rubric where 1 = opposite/unsafe, 3 = admissible, and 5 = perfect match to a human trainer), GPT-4o from video alone scores 2.17, while IAT conditioning reaches 2.44 (+12.4%), doubling the share of admissible generations with score >= 3 from 21% to 42%. Traditional text-similarity metrics also improve: word error rate decreases by 15-31% and ROUGE (phrase/substring overlap) increases by 9-64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.

Paper Structure

This paper contains 54 sections, 5 figures, 29 tables.

Figures (5)

  • Figure 1: Structure-aware pipeline for clinically aligned surgical feedback.(1) Multimodal inputs: video, inferred instrument motion, and procedure/task context are encoded and fused. (2) Clinically grounded representation: the fused features yield an Instrument–Action–Tissue (IAT) triplet summarizing the tool–tissue interaction. (3) Feedback generation & evaluation: predicted IATs condition an LLM to produce trainer-style feedback, assessed with a clinician-aligned fidelity and standard text metrics.
  • Figure 2: Surgical ontology extraction from raw trainer feedback.(a) Triplet Extraction — GPT-4o parses free-text feedback into one or more Instrument–Action–Tissue triplets $[\mathrm{I},\mathrm{A},\mathrm{T}]$, permitting null components when unmentioned. (b) Clustering & Normalization — a reasoning LLM (Gemini 2.5) clusters semantically related surface forms for each slot (I/A/T), merges them into functionally coherent meta-clusters, and prunes low-frequency categories. (c) Label-Space Mapping — canonical tags (e.g., energy device, coagulate, vein) and raw$\rightarrow$tag mappings are produced; this normalized label space provides weak supervision for the video$\rightarrow$ IAT model and conditions feedback generation.
  • Figure 3: Surgical instrument temporal motion tracking pipeline.(a)Input video frames from the endoscopic view are processed to capture tool appearance and motion. (b) A scene depth map is estimated (Depth Anything) to initialize spatially consistent point tracking. (c)Joint point tracking of instrument keypoints is performed across frames using CoTracker. (d) The resulting motion trajectories capture fine-grained temporal dynamics of instrument movement. (e) An LSTM encodes these trajectories into compact motion embeddings for downstream modeling.
  • Figure 4: Video$\rightarrow$IAT confusion matrices (combined). Per-class confusion for the three prediction heads—Instrument, Action, and Tissue/Target—computed on the validation split and visualized side-by-side in a single panel. The plots highlight characteristic confusions (e.g., left_hand vs. fourth_arm, energy actions around coagulate, and vascular classes such as general_vasculature vs. major_veins), as well as the impact of None labels. These diagnostics complement the AUC results in Table \ref{['tab:video_iat_auc_gain_results']} by illustrating where temporal tracking and clinical context reduce ambiguity.
  • Figure 5: IAT class frequency and elbow thresholds. Class counts for instruments, actions, and tissues (sorted by frequency). Dashed lines indicate elbow‐derived cutoffs (29/9/24); legend also reports the number of feedback lines with NONE for that triplet component.