Table of Contents
Fetching ...

End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction

Xi Li, Nicholas Matsumoto, Ujjwal Pasupulety, Atharva Deo, Cherine Yang, Jay Moran, Miguel E. Hernandez, Peter Wager, Jasmine Lin, Jeanine Kim, Alvin C. Goh, Christian Wagner, Geoffrey A. Sonn, Andrew J. Hung

TL;DR

Frame-to-Outcome (F2O) presents an end-to-end framework for translating intraoperative tissue-dissection videos into gesture sequences and linking them to postoperative outcomes. By combining transformer-based spatiotemporal modeling with frame-wise gesture classification and change-point aggregation, F2O achieves robust frame-level (AUC ≈ $0.80$) and video-level (AUC ≈ $0.81$) gesture recognition, while producing interpretable gesture-sequence features that predict erectile function recovery with accuracy ≈ $0.79$, matching or exceeding human-annotated baselines. Across 25 overlapping outcome-significant features, F2O and ground-truth signals exhibit near-identical effect directions and a strong correlation ($r \,=\, 0.96$, $p<1\times10^{-14}$), with meaningful patterns such as prolonged peel/spread gestures correlating with better outcomes and excessive energy use with poorer outcomes. The method generalizes across transformer backbones, remains data-efficient (effective with as little as 10% of data), and maintains performance under varying architectural and training choices, supporting scalable deployment for automated surgical analytics and prospective decision support. Overall, F2O provides a data-driven, interpretable bridge from fine-grained intraoperative actions to patient outcomes, enabling real-time feedback, automated annotation, and cross-domain surgical analytics.

Abstract

Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remain a longstanding challenge. We present Frame-to-Outcome (F2O), an end-to-end system that translates tissue dissection videos into gesture sequences and uncovers patterns associated with postoperative outcomes. Leveraging transformer-based spatial and temporal modeling and frame-wise classification, F2O robustly detects consecutive short (~2 seconds) gestures in the nerve-sparing step of robot-assisted radical prostatectomy (AUC: 0.80 frame-level; 0.81 video-level). F2O-derived features (gesture frequency, duration, and transitions) predicted postoperative outcomes with accuracy comparable to human annotations (0.79 vs. 0.75; overlapping 95% CI). Across 25 shared features, effect size directions were concordant with small differences (~ 0.07), and strong correlation (r = 0.96, p < 1e-14). F2O also captured key patterns linked to erectile function recovery, including prolonged tissue peeling and reduced energy use. By enabling automatic interpretable assessment, F2O establishes a foundation for data-driven surgical feedback and prospective clinical decision support.

End to End AI System for Surgical Gesture Sequence Recognition and Clinical Outcome Prediction

TL;DR

Frame-to-Outcome (F2O) presents an end-to-end framework for translating intraoperative tissue-dissection videos into gesture sequences and linking them to postoperative outcomes. By combining transformer-based spatiotemporal modeling with frame-wise gesture classification and change-point aggregation, F2O achieves robust frame-level (AUC ≈ ) and video-level (AUC ≈ ) gesture recognition, while producing interpretable gesture-sequence features that predict erectile function recovery with accuracy ≈ , matching or exceeding human-annotated baselines. Across 25 overlapping outcome-significant features, F2O and ground-truth signals exhibit near-identical effect directions and a strong correlation (, ), with meaningful patterns such as prolonged peel/spread gestures correlating with better outcomes and excessive energy use with poorer outcomes. The method generalizes across transformer backbones, remains data-efficient (effective with as little as 10% of data), and maintains performance under varying architectural and training choices, supporting scalable deployment for automated surgical analytics and prospective decision support. Overall, F2O provides a data-driven, interpretable bridge from fine-grained intraoperative actions to patient outcomes, enabling real-time feedback, automated annotation, and cross-domain surgical analytics.

Abstract

Fine-grained analysis of intraoperative behavior and its impact on patient outcomes remain a longstanding challenge. We present Frame-to-Outcome (F2O), an end-to-end system that translates tissue dissection videos into gesture sequences and uncovers patterns associated with postoperative outcomes. Leveraging transformer-based spatial and temporal modeling and frame-wise classification, F2O robustly detects consecutive short (~2 seconds) gestures in the nerve-sparing step of robot-assisted radical prostatectomy (AUC: 0.80 frame-level; 0.81 video-level). F2O-derived features (gesture frequency, duration, and transitions) predicted postoperative outcomes with accuracy comparable to human annotations (0.79 vs. 0.75; overlapping 95% CI). Across 25 shared features, effect size directions were concordant with small differences (~ 0.07), and strong correlation (r = 0.96, p < 1e-14). F2O also captured key patterns linked to erectile function recovery, including prolonged tissue peeling and reduced energy use. By enabling automatic interpretable assessment, F2O establishes a foundation for data-driven surgical feedback and prospective clinical decision support.

Paper Structure

This paper contains 31 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: Frame-to-Outcome (F2O) is an end-to-end AI system for surgical gesture sequence recognition and clinical outcome analysis. a, Surgical videos, such as those from the nerve-sparing (NS) step of robot-assisted radical prostatectomy (RARP), are annotated by trained human raters who identify over ten dominant gesture classes and label the start and end times of each gesture. Each video typically contains 2̃70 gestures over a 10-minute duration, with an average gesture lasting 2 seconds. Frame-to-Outcome (F2O) automates the recognition of these fine-grained gestures and enables downstream clinical outcome analysis. b, The system processes untrimmed tissue dissection videos and outputs a sequence of standardized surgical gesture probabilities by combining spatial and temporal modeling with frame-wise classification. Specifically, it processes sequences of 16 frames, leveraging spatial and temporal neighbors (red and green) to compute self-attention for each target patch (blue). These context-aware embeddings are then passed through a frame-wise classifier, which produces gesture probability distributions for each frame based on the aggregated representations. c, Sequence-based feature engineering is then applied to identify relationships with clinical outcomes, and results are evaluated through concordance analysis, including both feature-level and model-level concordance.
  • Figure 2: F2O accurately classifies frames across classes and videos. a, Frame-level performance across gesture classes of cold cut (c), hook (h), clip (k), camera move (m), peel (p), retraction (r), spread (s), assistant (a), coagulation (g), and energy cut (e), evaluated over five randomized data splits. b, Temporal alignment between model-predicted gesture probabilities and ground truth when applied frame-by-frame to the entire video. c, Video-level performance across 29 full-length videos in the test set, showing the distribution of AUC values.
  • Figure 3: Sequence based features
  • Figure 4: Effect size comparisons between the overlapping significant features of F2O and the ground truth gesture derived features. Positive effect size shows that the feature is correlated with better EF outcome. All features show the same direction and representation of populations.
  • Figure 5: Experiments across backbones and data scales showed consistent performance. a, Comparison of model performance across different transformer backbones, illustrating architectural flexibility. b, Impact of key design components on classification performance including the frame classifier architecture (Temporal Transformer Encoder), pretraining weights (Kinetics-400), and backbone optimization methods (Partial unfreezing, Low-Rank Adaptation), measured as relative changes from the baseline system. c, Model performance across increasing training data volumes, demonstrating robustness in low-data scenarios and scalability to new sites.