Neural Finite-State Machines for Surgical Phase Recognition
Hao Ding, Zhongpai Gao, Benjamin Planche, Tianyu Luan, Abhishek Sharma, Meng Zheng, Ange Lou, Terrence Chen, Mathias Unberath, Ziyan Wu
TL;DR
The paper addresses surgical phase recognition (SPR) by tackling fragmentation and poor long-term temporal coherence. It proposes Neural Finite-State Machines (NFSM), a plug-in module that merges learnable global phase embeddings with dynamic transition tables to impose phase-to-phase structure while leveraging modern neural backbones. Through transition-aware training and inference, NFSM improves temporal consistency and predictive accuracy, achieving state-of-the-art results on benchmarks such as BernBypass70 and demonstrating robustness across architectures. The work highlights NFSM’s adaptability and potential to integrate procedural knowledge into deep learning for more reliable, long-range video analysis in surgical workflows.
Abstract
Surgical phase recognition (SPR) is crucial for applications in workflow optimization, performance evaluation, and real-time intervention guidance. However, current deep learning models often struggle with fragmented predictions, failing to capture the sequential nature of surgical workflows. We propose the Neural Finite-State Machine (NFSM), a novel approach that enforces temporal coherence by integrating classical state-transition priors with modern neural networks. NFSM leverages learnable global state embeddings as unique phase identifiers and dynamic transition tables to model phase-to-phase progressions. Additionally, a future phase forecasting mechanism employs repeated frame padding to anticipate upcoming transitions. Implemented as a plug-and-play module, NFSM can be integrated into existing SPR pipelines without changing their core architectures. We demonstrate state-of-the-art performance across multiple benchmarks, including a significant improvement on the BernBypass70 dataset - raising video-level accuracy by 0.9 points and phase-level precision, recall, F1-score, and mAP by 3.8, 3.1, 3.3, and 4.1, respectively. Ablation studies confirm each component's effectiveness and the module's adaptability to various architectures. By unifying finite-state principles with deep learning, NFSM offers a robust path toward consistent, long-term surgical video analysis.
