Table of Contents
Fetching ...

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He, Zongyuan Ge

TL;DR

This work tackles the challenge of ophthalmic surgical video-language understanding by constructing OphVL, a large-scale, hierarchically structured dataset with 375K clip-text pairs and 30K silent videos, and proposing OphCLIP, a hierarchical retrieval-augmented pretraining framework. OphCLIP learns fine-grained clip-level representations from narrated content and coarse-grained video-level representations from titles, while leveraging a silent-video knowledge pool via a dynamic memory bank and MIPS-based retrieval to enhance long-term understanding. Through extensive zero-shot and few-/full-shot experiments across 11 downstream datasets, OphCLIP achieves state-of-the-art performance in surgical phase and multi-instrument recognition, demonstrating strong generalization and practical potential for ophthalmic AI tools. The combination of hierarchical video-text correspondences and retrieval-augmented learning offers a scalable path toward robust, context-aware surgical VLP in real-world settings.

Abstract

Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

TL;DR

This work tackles the challenge of ophthalmic surgical video-language understanding by constructing OphVL, a large-scale, hierarchically structured dataset with 375K clip-text pairs and 30K silent videos, and proposing OphCLIP, a hierarchical retrieval-augmented pretraining framework. OphCLIP learns fine-grained clip-level representations from narrated content and coarse-grained video-level representations from titles, while leveraging a silent-video knowledge pool via a dynamic memory bank and MIPS-based retrieval to enhance long-term understanding. Through extensive zero-shot and few-/full-shot experiments across 11 downstream datasets, OphCLIP achieves state-of-the-art performance in surgical phase and multi-instrument recognition, demonstrating strong generalization and practical potential for ophthalmic AI tools. The combination of hierarchical video-text correspondences and retrieval-augmented learning offers a scalable path toward robust, context-aware surgical VLP in real-world settings.

Abstract

Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.

Paper Structure

This paper contains 23 sections, 8 equations, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Dataset comparison and results comparison.TOP: comparison of our OphVL with existing fully-supervised learning (FSL), VLP, and Q&A datasets. Bottom: accuracy comparison of CLIP, CLIP* (CLIP fine-tuned on OphVL dataset), and OphCLIP (ours) on phase recognition datasets.
  • Figure 2: Overview of OphVL construction pipeline. The curation pipeline starts with collecting real-world ophthalmic surgery videos and channels using over 3K expert-identified keywords. Next, we filter videos based on their “narrative style” to ensure rich explanatory content. For text extraction, we use ASR models to transcribe audio, followed by denoising and quality control using NTLK and SurgicBERTa to refine and correct medical terminology. Post-processing is done using LLMs to extract structured surgical descriptions.
  • Figure 3: OphCLIP's framework for video-language pretraining. OphCLIP performs vision-language pretraining at both clip and video levels, learning short-term visual representations from narrations and long-term representations from titles, enhanced by a knowledge base. OphCLIP has several components: Narrative videos with associated narrative texts are processed through visual and text encoders, creating clip-level multi-modal embeddings; Silent videos' multi-modal embeddings are stored in the dynamically updated memory bank, constructing the knowledge base; Video-level pretraining uses maximum inner product search to retrieve relevant top-K silent videos' embeddings based on queries to enhance the video-level pretraining.
  • Figure 4: Attention map visualizations among CLIP, CLIP* (CLIP fine-tuned on OphVL), and OphCLIP (Ours) for phase recognition (left) and instrument recognition (right) examples from unseen Cataract-1K dataset.Left: For phase recognition (e.g., "phacoemulsification"), OphCLIP focuses on instruments and anatomy to identify the high-level surgical phase concept. Right: For instrument recognition, pretraining on OphVL enables CLIP* and OphCLIP to attend consistently to domain-specific tools like the lens injector.
  • Figure 5: Some examples of clip-text pairs from OphVL.
  • ...and 1 more figures