Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin Representation
Hao Ding, Yuqian Zhang, Wenzheng Cheng, Xinyu Wang, Xu Lian, Chenhao Yu, Hongchao Shu, Ji Woong Kim, Axel Krieger, Mathias Unberath
TL;DR
The paper addresses the fragility of surgical phase recognition (SPR) models under domain shifts and corruptions by introducing a digital twin (DT) representation that decouples low-level visual processing from high-level SPR. It combines SAM2 segmentation and DepthAnything depth estimation to build DT tokens that replace raw video inputs in a SurgFormer-based SPR backbone (DT Former). The authors demonstrate enhanced robustness on corrupted and out-of-distribution data (e.g., CRCD and a robotic dVRK dataset) and show additional gains when using DT representations as augmentation to raw inputs. This DT-driven approach offers a path toward more reliable, potentially interpretable SPR systems and could accelerate clinical translation by mitigating non-causal learning and improving generalization.
Abstract
Surgical phase recognition (SPR) is an integral component of surgical data science, enabling high-level surgical analysis. End-to-end trained neural networks that predict surgical phase directly from videos have shown excellent performance on benchmarks. However, these models struggle with robustness due to non-causal associations in the training set. Our goal is to improve model robustness to variations in the surgical videos by leveraging the digital twin (DT) paradigm -- an intermediary layer to separate high-level analysis (SPR) from low-level processing. As a proof of concept, we present a DT representation-based framework for SPR from videos. The framework employs vision foundation models with reliable low-level scene understanding to craft DT representation. We embed the DT representation in place of raw video inputs in the state-of-the-art SPR model. The framework is trained on the Cholec80 dataset and evaluated on out-of-distribution (OOD) and corrupted test samples. Contrary to the vulnerability of the baseline model, our framework demonstrates strong robustness on both OOD and corrupted samples, with a video-level accuracy of 80.3 on a highly corrupted Cholec80 test set, 67.9 on the challenging CRCD dataset, and 99.8 on an internal robotic surgery dataset, outperforming the baseline by 3.9, 16.8, and 90.9 respectively. We also find that using DT representation as an augmentation to the raw input can significantly improve model robustness. Our findings lend support to the thesis that DT representations are effective in enhancing model robustness. Future work will seek to improve the feature informativeness and incorporate interpretability for a more comprehensive framework.
