Table of Contents
Fetching ...

Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin Representation

Hao Ding, Yuqian Zhang, Wenzheng Cheng, Xinyu Wang, Xu Lian, Chenhao Yu, Hongchao Shu, Ji Woong Kim, Axel Krieger, Mathias Unberath

TL;DR

The paper addresses the fragility of surgical phase recognition (SPR) models under domain shifts and corruptions by introducing a digital twin (DT) representation that decouples low-level visual processing from high-level SPR. It combines SAM2 segmentation and DepthAnything depth estimation to build DT tokens that replace raw video inputs in a SurgFormer-based SPR backbone (DT Former). The authors demonstrate enhanced robustness on corrupted and out-of-distribution data (e.g., CRCD and a robotic dVRK dataset) and show additional gains when using DT representations as augmentation to raw inputs. This DT-driven approach offers a path toward more reliable, potentially interpretable SPR systems and could accelerate clinical translation by mitigating non-causal learning and improving generalization.

Abstract

Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing. As a proof of concept, we present a DT representation-based framework for SPR from videos. The framework employs vision foundation models with reliable low-level scene understanding to craft DT representation. We embed the DT representation in place of raw video inputs in the state-of-the-art SPR model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 80.3 on a highly corrupted Cholec80 test set, 67.9 on the challenging CRCD dataset, and 99.8 on an internal robotic surgery dataset, outperforming the baseline by 3.9, 16.8, and 90.9 respectively. We also find that using DT representation as an augmentation to the raw input can significantly improve model robustness. Our findings lend support to the thesis that DT representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness and incorporate interpretability for a more comprehensive framework.

Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin Representation

TL;DR

The paper addresses the fragility of surgical phase recognition (SPR) models under domain shifts and corruptions by introducing a digital twin (DT) representation that decouples low-level visual processing from high-level SPR. It combines SAM2 segmentation and DepthAnything depth estimation to build DT tokens that replace raw video inputs in a SurgFormer-based SPR backbone (DT Former). The authors demonstrate enhanced robustness on corrupted and out-of-distribution data (e.g., CRCD and a robotic dVRK dataset) and show additional gains when using DT representations as augmentation to raw inputs. This DT-driven approach offers a path toward more reliable, potentially interpretable SPR systems and could accelerate clinical translation by mitigating non-causal learning and improving generalization.

Abstract

Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing. As a proof of concept, we present a DT representation-based framework for SPR from videos. The framework employs vision foundation models with reliable low-level scene understanding to craft DT representation. We embed the DT representation in place of raw video inputs in the state-of-the-art SPR model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 80.3 on a highly corrupted Cholec80 test set, 67.9 on the challenging CRCD dataset, and 99.8 on an internal robotic surgery dataset, outperforming the baseline by 3.9, 16.8, and 90.9 respectively. We also find that using DT representation as an augmentation to the raw input can significantly improve model robustness. Our findings lend support to the thesis that DT representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness and incorporate interpretability for a more comprehensive framework.

Paper Structure

This paper contains 12 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of the DT paradigm. DT Paradigm demonstrates a clear seperation between low-level processing with high-level analysis based on DT representation
  • Figure 2: Illustration of the surgical phase recognition framework via DT representation.
  • Figure 3: Visual examples of the original and corrupted Cholec80 TwinandaSMMMP17endonet, CRCD koh2024crcd,and the robotics training dataset.