Table of Contents
Fetching ...

CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

Abstract

Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Abstract

Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

Paper Structure

This paper contains 21 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overall CliPPER Framework:CliPPER is a powerful pretraining framework for surgical video–language understanding, designed to capture long-form and contextual relationships. It leverages a suite of novel objectives: including context-aware contrastive learning ($\mathcal{L}_{VTC_{CTX}}$) and Cycle-Consistency Alignment loss ($\mathcal{L}_{cycle}$), Frame-Text Matching ($\mathcal{L}_{FTM}$) for pinpointing relevant frames in long procedures, and Clip Order Prediction ($\mathcal{L}_{COP}$) for temporal reasoning. CliPPER demonstrates strong performance across a wide range of complex surgical downstream tasks - from high-level phase recognition to fine-grained activity triplet identification - without requiring any additional training.
  • Figure 2: CliPPER overview: CliPPER processes multiple clips from the same video independently through modality-specific encoders with a Video Encoder for visual input and a Text Encoder for corresponding captions. The resulting pooled frame-level and text CLS embeddings are then passed through context encoders, which operate across clips from the same video to generate context-aware embeddings for each modality independently. On the Dual Encoder representations, we apply the standard video-text contrastive loss ($\mathcal{L_{VTC}}$). The context-aware representations are optimized through $\mathbf{VTC_{CTX}}$, a context-aware contrastive objective, together with a Cycle-Consistency Alignment loss that enforces bidirectional alignment consistency. In parallel, the model fuses all frame and text embeddings across clips using a Multi-Modal Encoder, enabling an additional objective: $\mathbf{FTM}$, which guides the model toward fine-grained alignment by learning to localize which frames from multiple clips from the same video match a given text. Lastly, we fuse the visual and textual contextual embeddings in a separate step as well by applying a single Cross-Attention layer. On top of these fused embeddings, we independently predict the temporal order of the elements called COP, enabling the model to reason about sequence information.
  • Figure 3: Zero-shot Phase Recognition at varying temporal windows: The final plot shows the average performance across all datasets. The highlighted values indicate the improvement over the strongest baseline for both Ours-SVL and Ours (SVL + YouTube).
  • Figure 4: Zero-shot Step Recognition at varying temporal windows: The final plot shows the average performance across all datasets. The highlighted values indicate the improvement over the strongest baseline for both Ours-SVL and Ours (SVL + YouTube).