Table of Contents
Fetching ...

MoSFormer: Augmenting Temporal Context with Memory of Surgery for Surgical Phase Recognition

Hao Ding, Xu Lian, Mathias Unberath

TL;DR

MoSFormer addresses the challenge of capturing long-horizon context in surgical phase recognition by augmenting a sliding-window Transformer with Memory of Surgery (MoS) that combines long-term, semantically interpretable history and short-term visual impressions. It introduces memory encoding and fusion and a memory caching pipeline, plus step filtering, to stabilize training and inference and mitigate shortcut learning. Empirical results show state-of-the-art performance on BernBypass70 (video-level accuracy 88.0; phase-level 70.7/68.7/66.3) and strong gains on Cholec80 and AutoLaparo compared to Surgformer. Ablation and counterfactual analyses corroborate the complementary benefits of the memory components and demonstrate improved temporal consistency and procedure-level understanding. The work suggests MoS as a generalizable framework for extended temporal context in surgical video analysis.

Abstract

Surgical phase recognition from video enables various downstream applications. Transformer-based sliding window approaches have set the state-of-the-art by capturing rich spatial-temporal features. However, while transformers can theoretically handle arbitrary-length sequences, in practice they are limited by memory and compute constraints, resulting in fixed context windows that struggle with maintaining temporal consistency across lengthy surgical procedures. This often leads to fragmented predictions and limited procedure-level understanding. To address these challenges, we propose Memory of Surgery (MoS), a framework that enriches temporal modeling by incorporating both semantic interpretable long-term surgical history and short-term impressions. MoSFormer, our enhanced transformer architecture, integrates MoS using a carefully designed encoding and fusion mechanism. We further introduce step filtering to refine history representation and develop a memory caching pipeline to improve training and inference stability, mitigating shortcut learning and overfitting. MoSFormer demonstrates state-of-the-art performance on multiple benchmarks. On the Challenging BernBypass70 benchmark, it attains 88.0 video-level accuracy and phase-level metrics of 70.7 precision, 68.7 recall, and 66.3 F1 score, outperforming its baseline with 2.1 video-level accuracy and phase-level metrics of 4.6 precision, 3.6 recall, and 3.8 F1 score. Further studies confirms the individual and combined benefits of long-term and short-term memory components through ablation and counterfactual inference. Qualitative results shows improved temporal consistency. The augmented temporal context enables procedure-level understanding, paving the way for more comprehensive surgical video analysis.

MoSFormer: Augmenting Temporal Context with Memory of Surgery for Surgical Phase Recognition

TL;DR

MoSFormer addresses the challenge of capturing long-horizon context in surgical phase recognition by augmenting a sliding-window Transformer with Memory of Surgery (MoS) that combines long-term, semantically interpretable history and short-term visual impressions. It introduces memory encoding and fusion and a memory caching pipeline, plus step filtering, to stabilize training and inference and mitigate shortcut learning. Empirical results show state-of-the-art performance on BernBypass70 (video-level accuracy 88.0; phase-level 70.7/68.7/66.3) and strong gains on Cholec80 and AutoLaparo compared to Surgformer. Ablation and counterfactual analyses corroborate the complementary benefits of the memory components and demonstrate improved temporal consistency and procedure-level understanding. The work suggests MoS as a generalizable framework for extended temporal context in surgical video analysis.

Abstract

Surgical phase recognition from video enables various downstream applications. Transformer-based sliding window approaches have set the state-of-the-art by capturing rich spatial-temporal features. However, while transformers can theoretically handle arbitrary-length sequences, in practice they are limited by memory and compute constraints, resulting in fixed context windows that struggle with maintaining temporal consistency across lengthy surgical procedures. This often leads to fragmented predictions and limited procedure-level understanding. To address these challenges, we propose Memory of Surgery (MoS), a framework that enriches temporal modeling by incorporating both semantic interpretable long-term surgical history and short-term impressions. MoSFormer, our enhanced transformer architecture, integrates MoS using a carefully designed encoding and fusion mechanism. We further introduce step filtering to refine history representation and develop a memory caching pipeline to improve training and inference stability, mitigating shortcut learning and overfitting. MoSFormer demonstrates state-of-the-art performance on multiple benchmarks. On the Challenging BernBypass70 benchmark, it attains 88.0 video-level accuracy and phase-level metrics of 70.7 precision, 68.7 recall, and 66.3 F1 score, outperforming its baseline with 2.1 video-level accuracy and phase-level metrics of 4.6 precision, 3.6 recall, and 3.8 F1 score. Further studies confirms the individual and combined benefits of long-term and short-term memory components through ablation and counterfactual inference. Qualitative results shows improved temporal consistency. The augmented temporal context enables procedure-level understanding, paving the way for more comprehensive surgical video analysis.

Paper Structure

This paper contains 7 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the memory-based augmentation for the current sliding window-based surgical phase recognition paradigm. Existing approaches rely on a sliding window for phase prediction, disregarding the rich temporal context in surgical videos. Our MoS-based framework captures temporal information through long-term history and short-term impressions, integrating them into existing architectures to augment temporal context in surgical video analysis.
  • Figure 2: Qualitative Comparision.