Table of Contents
Fetching ...

DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition

Kaixiang Yang, Qiang Li, Zhiwei Wang

TL;DR

DACAT introduces a dual-stream framework for online surgical phase recognition that fuses frame-wise embeddings with adaptive clip-aware context. A parameter-free Max-R read-out retrieves the most relevant past clip from a precomputed feature cache, and cross-attention merges this clipped context with current frame features to improve temporal modeling. Extensive experiments on Cholec80, M2CAI16, and AutoLaparo show DACAT achieves state-of-the-art Jaccard gains (≈4.5–4.6% on the first two datasets and ≈2.7% on AutoLaparo) with online inference reaching 38.1 fps, making it suitable for clinical deployment. Ablation studies confirm the complementary value of the two branches, the effectiveness of the adaptive clip read-out, and the benefit of fine-tuning the cache. Overall, the work demonstrates robust, clip-aware online surgical phase recognition with practical inference efficiency, and outlines directions to further reduce interference from challenging frames.

Abstract

Surgical phase recognition has become a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting. Current methods typically identify the surgical phase using individual frame-wise embeddings as the fundamental unit for time modeling. However, this approach is overly sensitive to current observations, often resulting in discontinuous and erroneous predictions within a complete surgical phase. In this paper, we propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship. In one stream, DACAT pretrains a frame encoder, caching all historical frame-wise features. In the other stream, DACAT fine-tunes a new frame encoder to extract the frame-wise feature at the current moment. Additionally, a max clip-response read-out (Max-R) module is introduced to bridge the two streams by using the current frame-wise feature to adaptively fetch the most relevant past clip from the feature cache. The clip-aware context feature is then encoded via cross-attention between the current frame and its fetched adaptive clip, and further utilized to enhance the time modeling for accurate online surgical phase recognition. The benchmark results on three public datasets, i.e., Cholec80, M2CAI16, and AutoLaparo, demonstrate the superiority of our proposed DACAT over existing state-of-the-art methods, with improvements in Jaccard scores of at least 4.5%, 4.6%, and 2.7%, respectively. Our code and models have been released at https://github.com/kk42yy/DACAT.

DACAT: Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition

TL;DR

DACAT introduces a dual-stream framework for online surgical phase recognition that fuses frame-wise embeddings with adaptive clip-aware context. A parameter-free Max-R read-out retrieves the most relevant past clip from a precomputed feature cache, and cross-attention merges this clipped context with current frame features to improve temporal modeling. Extensive experiments on Cholec80, M2CAI16, and AutoLaparo show DACAT achieves state-of-the-art Jaccard gains (≈4.5–4.6% on the first two datasets and ≈2.7% on AutoLaparo) with online inference reaching 38.1 fps, making it suitable for clinical deployment. Ablation studies confirm the complementary value of the two branches, the effectiveness of the adaptive clip read-out, and the benefit of fine-tuning the cache. Overall, the work demonstrates robust, clip-aware online surgical phase recognition with practical inference efficiency, and outlines directions to further reduce interference from challenging frames.

Abstract

Surgical phase recognition has become a crucial requirement in laparoscopic surgery, enabling various clinical applications like surgical risk forecasting. Current methods typically identify the surgical phase using individual frame-wise embeddings as the fundamental unit for time modeling. However, this approach is overly sensitive to current observations, often resulting in discontinuous and erroneous predictions within a complete surgical phase. In this paper, we propose DACAT, a novel dual-stream model that adaptively learns clip-aware context information to enhance the temporal relationship. In one stream, DACAT pretrains a frame encoder, caching all historical frame-wise features. In the other stream, DACAT fine-tunes a new frame encoder to extract the frame-wise feature at the current moment. Additionally, a max clip-response read-out (Max-R) module is introduced to bridge the two streams by using the current frame-wise feature to adaptively fetch the most relevant past clip from the feature cache. The clip-aware context feature is then encoded via cross-attention between the current frame and its fetched adaptive clip, and further utilized to enhance the time modeling for accurate online surgical phase recognition. The benchmark results on three public datasets, i.e., Cholec80, M2CAI16, and AutoLaparo, demonstrate the superiority of our proposed DACAT over existing state-of-the-art methods, with improvements in Jaccard scores of at least 4.5%, 4.6%, and 2.7%, respectively. Our code and models have been released at https://github.com/kk42yy/DACAT.
Paper Structure (12 sections, 4 equations, 4 figures, 5 tables)

This paper contains 12 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overall framework of DACAT, which consists of two main branches, (1) Frame-wise Branch (FWB) and (2) Adaptive Clip-aware Branch (ACB). FWB extracts the embeddings for current frame $x_t$. ACB obtains the most relevant clip-aware features for $x_t$ through designed Max Clip-Response Read-out (Max-R) and cross-attention (CA). Finally, combining the results of FWB and ACB to obtain the phase prediction. $S$, $P$, and $\mathbf{AC}(t)$ are frame response matrix, clip response matrix and adaptive clip, respectively. Response is formulated as Eq. \ref{['response_eq']}.
  • Figure 2: Visualization comparison with previous SOTA on Cholec80. (a) and (b) represent good prediction, while (c) and (d) show relatively poor results.
  • Figure 3: The Jaccard of four read-out ways with respect to phase on Cholec80.
  • Figure 4: Visualization of adaptive clip. (a) shows the largest accuracy improvement compared with baseline ($w/o$ ACB). (b) visualizes the adaptive clip for two frames, a successful Case 1 and a failed Case 2.