Table of Contents
Fetching ...

CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

Kaixiang Yang, Xin Li, Qiang Li, Zhiwei Wang

TL;DR

CoStoDet-DDPM introduces a collaborative training framework that blends a deterministic surgical workflow predictor with a stochastic DDPM branch to address variability across patients and procedures. By training both branches jointly and discarding the DDPM at inference, the method achieves real-time performance while leveraging stochastic representations to improve anticipation and recognition accuracy. The approach yields state-of-the-art results on Cholec80 and AutoLaparo, with substantial gains in anticipation error and Jaccard for phase recognition, and robust generalization to patient-specific variations. Ablation studies confirm the complementary roles of the two branches and offer practical insights into encoder, temporal span, and diffusion configurations. This work demonstrates that diffusion-based feature enhancement can boost clinical predictive reliability without sacrificing speed, enabling safer and more efficient intraoperative assistance.

Abstract

Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries.In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage.Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy.Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at https://github.com/kk42yy/CoStoDet-DDPM.

CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

TL;DR

CoStoDet-DDPM introduces a collaborative training framework that blends a deterministic surgical workflow predictor with a stochastic DDPM branch to address variability across patients and procedures. By training both branches jointly and discarding the DDPM at inference, the method achieves real-time performance while leveraging stochastic representations to improve anticipation and recognition accuracy. The approach yields state-of-the-art results on Cholec80 and AutoLaparo, with substantial gains in anticipation error and Jaccard for phase recognition, and robust generalization to patient-specific variations. Ablation studies confirm the complementary roles of the two branches and offer practical insights into encoder, temporal span, and diffusion configurations. This work demonstrates that diffusion-based feature enhancement can boost clinical predictive reliability without sacrificing speed, enabling safer and more efficient intraoperative assistance.

Abstract

Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries.In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage.Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy.Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and weight are available at https://github.com/kk42yy/CoStoDet-DDPM.

Paper Structure

This paper contains 23 sections, 11 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Different paradigms. (a) The deterministic workflow used in previous methods fails to account for the individual patient variations. (b) The conditional DDPM used in natural action segmentation is difficult to meet the real-time requirements and, when applied to clinical settings, lacks clinical consistency. (c) Our proposed co-training approach combines deterministic and stochastic models, allowing DDPM to be discarded during inference. The dashed arrows indicate the gradient back-propagation.
  • Figure 2: CoStoDet-DDPM architecture consists of a Deterministic Task Branch ($\mathcal{T}$) and a Stochastic DDPM Branch ($\mathcal{D}$). Here, $k$ represents the time step in $\mathcal{D}$. During training, both branches generate outputs, while during inference, only the results from $\mathcal{T}$ are utilized to meet real-time requirements.
  • Figure 3: Collapse and Mitigation of the Dominant Pattern. In (a), without DDPM, $\mathcal{C}$ relies solely on $\mathcal{L}_{Task}$, leading to a sparse feature distribution. The model favors the dominant data, causing long-tail data to be misrepresented or poorly mapped. After incorporating DDPM, the collapse is alleviated, as shown in (b). First, the denoising imposes an additional constraint, encouraging $\mathbf{c}_t$ to aggregate toward regions that simultaneously satisfy both label mapping and denoising capabilities, as depicted in (c). This aggregation simplifies distribution boundary learning, facilitating global optimization. Second, DDPM’s denoising process implicitly embeds "navigation" information into $\mathbf{c}_t$, encoding the trajectory from noisy labels back to clean labels. This unique navigation information enables more accurate mapping and mitigates distribution disparities. As shown in (d), it is likely embedded in the minor components or local structures of $\mathbf{c}_t$. Notably, both (c) and (d) are derived from actual feature analyses.
  • Figure 4: Visualization comparison for the anticipation task. (a) shows the phase, and (b) displays the tool visualization. In each sub-figure, the top row represents the SOTA method BNPitfalls, and the bottom row represents ours.
  • Figure 5: Visualization comparison for the recognition task. (a) and (b) showcase accurate predictions, while (c) and (d) depict relatively poorer results.
  • ...and 5 more figures