Table of Contents
Fetching ...

SR-Mamba: Effective Surgical Phase Recognition with State Space Model

Rui Cao, Jiangliu Wang, Yun-Hui Liu

TL;DR

This work tackles surgical phase recognition in long videos by modeling long-range temporal dependencies without attention-based architectures. It introduces SR-Mamba, a state-space-model-based framework with a bidirectional Mamba decoder and a single-step end-to-end training regime that jointly optimizes spatial feature extraction and temporal modeling. The approach achieves state-of-the-art results on Cholec80 and CATARACTS using a lightweight ResNet34 backbone, demonstrating strong temporal reasoning with improved efficiency. The findings emphasize the value of bidirectional temporal modeling and long-sequence handling for accurate, practical surgical workflow analysis in computer-assisted interventions.

Abstract

Surgical phase recognition is crucial for enhancing the efficiency and safety of computer-assisted interventions. One of the fundamental challenges involves modeling the long-distance temporal relationships present in surgical videos. Inspired by the recent success of Mamba, a state space model with linear scalability in sequence length, this paper presents SR-Mamba, a novel attention-free model specifically tailored to meet the challenges of surgical phase recognition. In SR-Mamba, we leverage a bidirectional Mamba decoder to effectively model the temporal context in overlong sequences. Moreover, the efficient optimization of the proposed Mamba decoder facilitates single-step neural network training, eliminating the need for separate training steps as in previous works. This single-step training approach not only simplifies the training process but also ensures higher accuracy, even with a lighter spatial feature extractor. Our SR-Mamba establishes a new benchmark in surgical video analysis by demonstrating state-of-the-art performance on the Cholec80 and CATARACTS Challenge datasets. The code is accessible at https://github.com/rcao-hk/SR-Mamba.

SR-Mamba: Effective Surgical Phase Recognition with State Space Model

TL;DR

This work tackles surgical phase recognition in long videos by modeling long-range temporal dependencies without attention-based architectures. It introduces SR-Mamba, a state-space-model-based framework with a bidirectional Mamba decoder and a single-step end-to-end training regime that jointly optimizes spatial feature extraction and temporal modeling. The approach achieves state-of-the-art results on Cholec80 and CATARACTS using a lightweight ResNet34 backbone, demonstrating strong temporal reasoning with improved efficiency. The findings emphasize the value of bidirectional temporal modeling and long-sequence handling for accurate, practical surgical workflow analysis in computer-assisted interventions.

Abstract

Surgical phase recognition is crucial for enhancing the efficiency and safety of computer-assisted interventions. One of the fundamental challenges involves modeling the long-distance temporal relationships present in surgical videos. Inspired by the recent success of Mamba, a state space model with linear scalability in sequence length, this paper presents SR-Mamba, a novel attention-free model specifically tailored to meet the challenges of surgical phase recognition. In SR-Mamba, we leverage a bidirectional Mamba decoder to effectively model the temporal context in overlong sequences. Moreover, the efficient optimization of the proposed Mamba decoder facilitates single-step neural network training, eliminating the need for separate training steps as in previous works. This single-step training approach not only simplifies the training process but also ensures higher accuracy, even with a lighter spatial feature extractor. Our SR-Mamba establishes a new benchmark in surgical video analysis by demonstrating state-of-the-art performance on the Cholec80 and CATARACTS Challenge datasets. The code is accessible at https://github.com/rcao-hk/SR-Mamba.
Paper Structure (16 sections, 6 equations, 3 figures, 5 tables)

This paper contains 16 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Previous two-step training strategy. (b) Our proposed single-step method jointly trains a spatial feature extractor and a temporal model in a single step.
  • Figure 2: Overview of the proposed SR-Mamba architecture for surgical phase recognition. (a) Sequential processing of surgical video frames $\mathbf{X}=\{x_{t}\}_{t=1}^{T}$ through a spatial feature extractor $\mathbf{\Phi}$. The resulting embeddings are then fed into a Bidirectional Mamba Decoder. (b) Detailed view of the Bidirectional Mamba Decoder.
  • Figure 3: Color-coded ribbon illustration of one complete surgical video from Cholec80 dataset cholec80. The time axes have been adjusted for improved visualization.