Table of Contents
Fetching ...

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

Yuxin He, An Li, Cheng Xue

TL;DR

The paper tackles domain generalization for surgical phase recognition under sim-to-real shifts, where VR-like training data differ from real clinical scenes. It proposes CauCLIP, a causality-inspired vision-language framework built on CLIP that uses frequency-domain augmentation and a causal suppression loss to enforce domain-invariant, causally relevant features. The method integrates a CLIP-based video-text matching objective with these causal components in a unified loss, achieving state-of-the-art results on the SurgVisDom hard adaptation benchmark, notably outperforming SDA-CLIP and other baselines. The findings demonstrate that focusing on stable causal factors and controlling style-related cues improves cross-domain robustness, with potential to enhance context-aware decision support in operating rooms.

Abstract

Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

TL;DR

The paper tackles domain generalization for surgical phase recognition under sim-to-real shifts, where VR-like training data differ from real clinical scenes. It proposes CauCLIP, a causality-inspired vision-language framework built on CLIP that uses frequency-domain augmentation and a causal suppression loss to enforce domain-invariant, causally relevant features. The method integrates a CLIP-based video-text matching objective with these causal components in a unified loss, achieving state-of-the-art results on the SurgVisDom hard adaptation benchmark, notably outperforming SDA-CLIP and other baselines. The findings demonstrate that focusing on stable causal factors and controlling style-related cues improves cross-domain robustness, with potential to enhance context-aware decision support in operating rooms.

Abstract

Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
Paper Structure (11 sections, 7 equations, 3 figures, 2 tables)

This paper contains 11 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of the same task in different domains. Domain shift introduces non-causal factors (E) such as lighting and texture, while causal factors (C) remain task-relevant. Our causal suppression module reduces the influence of E and reinforces C.
  • Figure 2: The framework of CauCLIP. The model learns visual-text alignment using original and augmented videos. A causal suppression module ensures that representations are invariant to stylistic perturbations introduced in the frequency domain.
  • Figure 3: Analysis on $\lambda_{aug}$ and $\lambda_{sup}$ on balanced accuracy