SuPRA: Surgical Phase Recognition and Anticipation for Intra-Operative Planning

Maxence Boels; Yang Liu; Prokar Dasgupta; Alejandro Granados; Sebastien Ourselin

SuPRA: Surgical Phase Recognition and Anticipation for Intra-Operative Planning

Maxence Boels, Yang Liu, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin

TL;DR

SuPRA addresses the crucial need for intra-operative guidance by jointly recognizing the current surgical phase and predicting upcoming phases. It introduces a unified Transformer-based architecture that combines a spatial feature extractor, long-term compression, a future-generation decoder, and segment-level predictions, optimized with a multi-task loss. Evaluations on Cholec80 and AutoLaparo21 show competitive phase recognition and strong next-phase anticipation, aided by novel segment-based metrics that capture temporal dynamics. This work advances intra-operative video understanding by enabling proactive workflow planning and potential integration with real-time guidance systems.

Abstract

Intra-operative recognition of surgical phases holds significant potential for enhancing real-time contextual awareness in the operating room. However, we argue that online recognition, while beneficial, primarily lends itself to post-operative video analysis due to its limited direct impact on the actual surgical decisions and actions during ongoing procedures. In contrast, we contend that the prediction and anticipation of surgical phases are inherently more valuable for intra-operative assistance, as they can meaningfully influence a surgeon's immediate and long-term planning by providing foresight into future steps. To address this gap, we propose a dual approach that simultaneously recognises the current surgical phase and predicts upcoming ones, thus offering comprehensive intra-operative assistance and guidance on the expected remaining workflow. Our novel method, Surgical Phase Recognition and Anticipation (SuPRA), leverages past and current information for accurate intra-operative phase recognition while using future segments for phase prediction. This unified approach challenges conventional frameworks that treat these objectives separately. We have validated SuPRA on two reputed datasets, Cholec80 and AutoLaparo21, where it demonstrated state-of-the-art performance with recognition accuracies of 91.8% and 79.3%, respectively. Additionally, we introduce and evaluate our model using new segment-level evaluation metrics, namely Edit and F1 Overlap scores, for a more temporal assessment of segment classification. In conclusion, SuPRA presents a new multi-task approach that paves the way for improved intra-operative assistance through surgical phase recognition and prediction of future events.

SuPRA: Surgical Phase Recognition and Anticipation for Intra-Operative Planning

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 4 figures, 3 tables)

This paper contains 18 sections, 1 equation, 4 figures, 3 tables.

Introduction
Related Work
Methods
Spatial Feature Extractor
Long-Term Compression
Future Generation
Frame Recognition
Segment Prediction
Training Objectives
Segment-based Evaluation Metrics
Implementation Details
Experiments and Results
Experimental Setup
Phase Recognition
Next Phase Prediction:
...and 3 more sections

Figures (4)

Figure 1: Overview of our SuPRA model for joint phase recognition and prediction applied to surgical video analysis. Top: Raw video frames undergo Long-term Compression to extract Key-Features. These are then classified into phases with the Frame Recognition module. The Future Generation module decodes those features into the future embeddings which are then classified by the Segment Prediction module, yielding the predicted upcoming phases. Bottom: Video's temporal progression (x-axis) against its key-feature dimensions (y-axis). Solid vertical lines delimit individual phase segments, while the striped line indicates the current time $t$. The filled color bars represent frames with observed key-features, while striped ones are yet to be observed. The left-to-right arrows represent the max-pooling operation on the compressed features while the right-to-left arrows represent the generated future key-features from the Future Generation module.
Figure 2: SuPRA Transformer model. We use frame embeddings, $F = {\{f_1, ..., f_T\}}$ extracted with a ViT backbone (omitted for simplicity) for all frames in a video $V = {\{x_1, ..., x_T\}}$. We first define an input clip as $c_t = {\{f_{t-l}, ..., f_t}\}$, where $l$ is the clip length. We then divide the clip into $l/w=n$ non-overlapping windows and define $w_t = {\{f_{t-w}, ..., f_t}\}$ as our input for the Past-Present Encoder to capture temporal patterns within this short clip. The Sliding Window Attention of length $w$ ensures that all frames undergo Self-Attention. These frame embeddings are further refined using Cross Attention within the Past-Present Decoder, encapsulated by the encoder-decoder module. For a concise representation of salient video features, Long-term Compression implements compression-pooling on all the encoded embeddings, yielding $w$ keys $k_{t-w}, \ldots, k_{t}$. In parallel, the Future Generation module exploits the Future Decoder to transform $n$ segment queries $q_1, q_2, \ldots, q_n$ into $n$ future decoded segment $s_1, \ldots s_n$, thereby forecasting unseen future segments i.e. phases. These generated segments are then compressed to $n$ key-segment $k_{s1}, \ldots, k_{sn}$, used to guide the decoder with a supervisory next-key-segment loss, $L_{\text{next-k-segment}}$. The Segment Prediction module translates decoded key-segments into subsequent phases, denoted $\hat{y}_{1}, \ldots, \hat{y}_{n}$, along with their durations $\hat{d}_{1}, \ldots, \hat{d}_{n}$, optimising the predictions via multiple losses—$L_{\text{next-phase}}$ and $L_{\text{next-duration}}$—through classifier heads. Lastly, the Frame Recognition integrates the processed data for final classification $\hat{y}_{t-w}, \ldots, \hat{y}_{t}$.
Figure 3: Qualitative illustration of the phase recognition (top 2 rows) and next phase prediction (bottom 2 rows) tasks for video 80 from the Cholec80 dataset with online frame classification results (top row) and their annotations (bottom row). We add an "end" class in our evaluation to replace the first class (pink segment).
Figure 4: Temporal Aggregation of Key-Features. The x-axis represents the sequential frames of the video, while the y-axis denotes the compressed key dimensions. For each frame, maxpooling is employed to aggregate the salient key-features, leading to a temporal accumulation of feature prominence as depicted from the increasing intensity variations. This methodology capitalises on the notion that a cumulative representation of dominant events and features over time enhances the comprehension of the current surgical phase. The goal is to leverage this aggregated information to anticipate and generate future compressed representations.

SuPRA: Surgical Phase Recognition and Anticipation for Intra-Operative Planning

TL;DR

Abstract

SuPRA: Surgical Phase Recognition and Anticipation for Intra-Operative Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)