SR-Mamba: Effective Surgical Phase Recognition with State Space Model
Rui Cao, Jiangliu Wang, Yun-Hui Liu
TL;DR
This work tackles surgical phase recognition in long videos by modeling long-range temporal dependencies without attention-based architectures. It introduces SR-Mamba, a state-space-model-based framework with a bidirectional Mamba decoder and a single-step end-to-end training regime that jointly optimizes spatial feature extraction and temporal modeling. The approach achieves state-of-the-art results on Cholec80 and CATARACTS using a lightweight ResNet34 backbone, demonstrating strong temporal reasoning with improved efficiency. The findings emphasize the value of bidirectional temporal modeling and long-sequence handling for accurate, practical surgical workflow analysis in computer-assisted interventions.
Abstract
Surgical phase recognition is crucial for enhancing the efficiency and safety of computer-assisted interventions. One of the fundamental challenges involves modeling the long-distance temporal relationships present in surgical videos. Inspired by the recent success of Mamba, a state space model with linear scalability in sequence length, this paper presents SR-Mamba, a novel attention-free model specifically tailored to meet the challenges of surgical phase recognition. In SR-Mamba, we leverage a bidirectional Mamba decoder to effectively model the temporal context in overlong sequences. Moreover, the efficient optimization of the proposed Mamba decoder facilitates single-step neural network training, eliminating the need for separate training steps as in previous works. This single-step training approach not only simplifies the training process but also ensures higher accuracy, even with a lighter spatial feature extractor. Our SR-Mamba establishes a new benchmark in surgical video analysis by demonstrating state-of-the-art performance on the Cholec80 and CATARACTS Challenge datasets. The code is accessible at https://github.com/rcao-hk/SR-Mamba.
