End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction
Xi Li, Nicholas Matsumoto, Ujjwal Pasupulety, Atharva Deo, Cherine Yang, Jay Moran, Miguel E. Hernandez, Peter Wager, Jasmine Lin, Jeanine Kim, Alvin C. Goh, Christian Wagner, Geoffrey A. Sonn, Andrew J. Hung
TL;DR
Frame-to-Outcome (F2O) presents an end-to-end framework for translating intraoperative tissue-dissection videos into gesture sequences and linking them to postoperative outcomes. By combining transformer-based spatiotemporal modeling with frame-wise gesture classification and change-point aggregation, F2O achieves robust frame-level (AUC ≈ $0.80$) and video-level (AUC ≈ $0.81$) gesture recognition, while producing interpretable gesture-sequence features that predict erectile function recovery with accuracy ≈ $0.79$, matching or exceeding human-annotated baselines. Across 25 overlapping outcome-significant features, F2O and ground-truth signals exhibit near-identical effect directions and a strong correlation ($r \,=\, 0.96$, $p<1\times10^{-14}$), with meaningful patterns such as prolonged peel/spread gestures correlating with better outcomes and excessive energy use with poorer outcomes. The method generalizes across transformer backbones, remains data-efficient (effective with as little as 10% of data), and maintains performance under varying architectural and training choices, supporting scalable deployment for automated surgical analytics and prospective decision support. Overall, F2O provides a data-driven, interpretable bridge from fine-grained intraoperative actions to patient outcomes, enabling real-time feedback, automated annotation, and cross-domain surgical analytics.
Abstract
Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remain a longstanding challenge. We present Frame-to-Outcome (F2O), an end-to-end system that translates tissue dissection videos into gesture sequences and uncovers patterns associated with postoperative outcomes. Leveraging transformer-based spatial and temporal modeling and frame-wise classification, F2O robustly detects consecutive short (~2 seconds) gestures in the nerve-sparing step of robot-assisted radical prostatectomy (AUC: 0.80 frame-level; 0.81 video-level). F2O-derived features (gesture frequency, duration, and transitions) predicted postoperative outcomes with accuracy comparable to human annotations (0.79 vs. 0.75; overlapping 95% CI). Across 25 shared features, effect size directions were concordant with small differences (~ 0.07), and strong correlation (r = 0.96, p < 1e-14). F2O also captured key patterns linked to erectile function recovery, including prolonged tissue peeling and reduced energy use. By enabling automatic interpretable assessment, F2O establishes a foundation for data-driven surgical feedback and prospective clinical decision support.
