DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition
Kaixiang Yang, Qiang Li, Zhiwei Wang
TL;DR
DACAT introduces a dual-stream framework for online surgical phase recognition that fuses frame-wise embeddings with adaptive clip-aware context. A parameter-free Max-R read-out retrieves the most relevant past clip from a precomputed feature cache, and cross-attention merges this clipped context with current frame features to improve temporal modeling. Extensive experiments on Cholec80, M2CAI16, and AutoLaparo show DACAT achieves state-of-the-art Jaccard gains (≈4.5–4.6% on the first two datasets and ≈2.7% on AutoLaparo) with online inference reaching 38.1 fps, making it suitable for clinical deployment. Ablation studies confirm the complementary value of the two branches, the effectiveness of the adaptive clip read-out, and the benefit of fine-tuning the cache. Overall, the work demonstrates robust, clip-aware online surgical phase recognition with practical inference efficiency, and outlines directions to further reduce interference from challenging frames.
Abstract
Surgical phase recognition has become a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting. Current methods typically identify the surgical phase using individual frame-wise embeddings as the fundamental unit for time modeling. However, this approach is overly sensitive to current observations, often resulting in discontinuous and erroneous predictions within a complete surgical phase. In this paper, we propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship. In one stream, DACAT pretrains a frame encoder, caching all historical frame-wise features. In the other stream, DACAT fine-tunes a new frame encoder to extract the frame-wise feature at the current moment. Additionally, a max clip-response read-out (Max-R) module is introduced to bridge the two streams by using the current frame-wise feature to adaptively fetch the most relevant past clip from the feature cache. The clip-aware context feature is then encoded via cross-attention between the current frame and its fetched adaptive clip, and further utilized to enhance the time modeling for accurate online surgical phase recognition. The benchmark results on three public datasets, i.e., Cholec80, M2CAI16, and AutoLaparo, demonstrate the superiority of our proposed DACAT over existing state-of-the-art methods, with improvements in Jaccard scores of at least 4.5%, 4.6%, and 2.7%, respectively. Our code and models have been released at https://github.com/kk42yy/DACAT.
