Table of Contents
Fetching ...

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

Mingxing Rao, Yinhong Qin, Soheil Kolouri, Jie Ying Wu, Daniel Moyer

TL;DR

This work addresses the need for scalable surgical gesture recognition without exhaustively annotating every possible gesture. It introduces Bridge-Prompt, a CLIP-based video encoder that is prompt-tuned with weak, text-augmented supervision to produce strong within-task and zero-shot gesture representations, evaluated on JIGSAWS and RARP-45. The method combines multi-prompt text signals and a fusion module to generate robust gesture embeddings, optimized by three contrastive losses that align video features with semantic, integrated, and statistical text prompts. The findings show Bridge-Prompt delivers competitive or superior gesture recognition performance and notable zero-shot generalization, suggesting a practical path toward flexible, data-efficient surgical assistance systems. The work also provides open-source code for prompt-tuning encoders, highlighting the practical impact for diverse procedures with minimal annotation.

Abstract

Purpose: In order to produce a surgical gesture recognition system that can support a wide variety of procedures, either a very large annotated dataset must be acquired, or fitted models must generalize to new labels (so called "zero-shot" capability). In this paper we investigate the feasibility of latter option. Methods: Leveraging the Bridge-Prompt framework, we prompt-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion Bridge-Prompt and similar pre-trained+prompt-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.

Zero-shot Prompt-based Video Encoder for Surgical Gesture Recognition

TL;DR

This work addresses the need for scalable surgical gesture recognition without exhaustively annotating every possible gesture. It introduces Bridge-Prompt, a CLIP-based video encoder that is prompt-tuned with weak, text-augmented supervision to produce strong within-task and zero-shot gesture representations, evaluated on JIGSAWS and RARP-45. The method combines multi-prompt text signals and a fusion module to generate robust gesture embeddings, optimized by three contrastive losses that align video features with semantic, integrated, and statistical text prompts. The findings show Bridge-Prompt delivers competitive or superior gesture recognition performance and notable zero-shot generalization, suggesting a practical path toward flexible, data-efficient surgical assistance systems. The work also provides open-source code for prompt-tuning encoders, highlighting the practical impact for diverse procedures with minimal annotation.

Abstract

Purpose: In order to produce a surgical gesture recognition system that can support a wide variety of procedures, either a very large annotated dataset must be acquired, or fitted models must generalize to new labels (so called "zero-shot" capability). In this paper we investigate the feasibility of latter option. Methods: Leveraging the Bridge-Prompt framework, we prompt-tune a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This can utilize extensive outside video data such as text, but also make use of label meta-data and weakly supervised contrastive losses. Results: Our experiments show that prompt-based video encoder outperforms standard encoders in surgical gesture recognition tasks. Notably, it displays strong performance in zero-shot scenarios, where gestures/tasks that were not provided during the encoder training phase are included in the prediction phase. Additionally, we measure the benefit of inclusion text descriptions in the feature extractor training schema. Conclusion Bridge-Prompt and similar pre-trained+prompt-tuned video encoder models present significant visual representation for surgical robotics, especially in gesture recognition tasks. Given the diverse range of surgical tasks (gestures), the ability of these models to zero-shot transfer without the need for any task (gesture) specific retraining makes them invaluable.
Paper Structure (14 sections, 4 equations, 4 figures, 8 tables)

This paper contains 14 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The two phases of our training schema: at top the Bridge-Prompt pre-training, at bottom a "simple probe" predictor measuring performance on the supervised gesture recognition task.
  • Figure 2: JIGSAWS Leave-One-User-Out boxplots for main text table 1.
  • Figure 3: JIGSAWS Leave-One-User-Out boxplots for main text table 2.
  • Figure 4: JIGSAWS Leave-One-User-Out boxplots for main text table 3.