CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
Yuxin He, An Li, Cheng Xue
TL;DR
The paper tackles domain generalization for surgical phase recognition under sim-to-real shifts, where VR-like training data differ from real clinical scenes. It proposes CauCLIP, a causality-inspired vision-language framework built on CLIP that uses frequency-domain augmentation and a causal suppression loss to enforce domain-invariant, causally relevant features. The method integrates a CLIP-based video-text matching objective with these causal components in a unified loss, achieving state-of-the-art results on the SurgVisDom hard adaptation benchmark, notably outperforming SDA-CLIP and other baselines. The findings demonstrate that focusing on stable causal factors and controlling style-related cues improves cross-domain robustness, with potential to enhance context-aware decision support in operating rooms.
Abstract
Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
