Table of Contents
Fetching ...

fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

Saurav Sharma, Didier Mutter, Nicolas Padoy

TL;DR

This paper tackles zero-shot fine-grained surgical action recognition by addressing CLIP’s reliance on global image features and its lack of hierarchy in triplet labels. It introduces fine-CLIP, which fuses object-centric features via Semantic Graph Condensation, hierarchical text prompts, and LoRA-based backbone adaptation, guided by a hierarchical margin loss. The approach is validated on the CholecT50 dataset with two challenging base-to-novel benchmarks—Unseen-Target and Unseen-Instrument-Verb—showing clear improvements in F1@3 and mAP over strong baselines. Overall, fine-CLIP advances zero-shot generalization for novel instrument-verb-tissue interactions, enabling more robust and adaptable surgical AI systems.

Abstract

While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and leverages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.

fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models

TL;DR

This paper tackles zero-shot fine-grained surgical action recognition by addressing CLIP’s reliance on global image features and its lack of hierarchy in triplet labels. It introduces fine-CLIP, which fuses object-centric features via Semantic Graph Condensation, hierarchical text prompts, and LoRA-based backbone adaptation, guided by a hierarchical margin loss. The approach is validated on the CholecT50 dataset with two challenging base-to-novel benchmarks—Unseen-Target and Unseen-Instrument-Verb—showing clear improvements in F1@3 and mAP over strong baselines. Overall, fine-CLIP advances zero-shot generalization for novel instrument-verb-tissue interactions, enabling more robust and adaptable surgical AI systems.

Abstract

While vision-language models like CLIP have advanced zero-shot surgical phase recognition, they struggle with fine-grained surgical activities, especially action triplets. This limitation arises because current CLIP formulations rely on global image features, which overlook the fine-grained semantics and contextual details crucial for complex tasks like zero-shot triplet recognition. Furthermore, these models do not explore the hierarchical structure inherent in triplets, reducing their ability to generalize to novel triplets. To address these challenges, we propose fine-CLIP, which learns object-centric features and leverages the hierarchy in triplet formulation. Our approach integrates three components: hierarchical prompt modeling to capture shared semantics, LoRA-based vision backbone adaptation for enhanced feature extraction, and a graph-based condensation strategy that groups similar patch features into meaningful object clusters. Since triplet classification is a challenging task, we introduce an alternative yet meaningful base-to-novel generalization benchmark with two settings on the CholecT50 dataset: Unseen-Target, assessing adaptability to triplets with novel anatomical structures, and Unseen-Instrument-Verb, where models need to generalize to novel instrument-verb interactions. fine-CLIP shows significant improvements in F1 and mAP, enhancing zero-shot recognition of novel surgical triplets.

Paper Structure

This paper contains 15 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Model Overview:fine-CLIP tunes the vision backbone using LoRA zanella2024low and extracts object-centric features via Semantic Graph Condensation (SGC). Hierarchical prompts generate two-level text embeddings, while object features enhance image features through attention. A hierarchical margin loss on the combined object-aware and levelwise logits guides the final prediction.
  • Figure 2: Performance on novel triplets in (a) UT and (b) UIV settings.
  • Figure 3: Qualitative Results: Visualization of the clusters. (Best viewed in color)