OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu; Kun Yuan; Yaling Shen; Feilong Tang; Xiaohao Xu; Lin Zhou; Wei Li; Ying Chen; Zhongxing Xu; Zelin Peng; Siyuan Yan; Vinkle Srivastav; Diping Song; Tianbin Li; Danli Shi; Jin Ye; Nicolas Padoy; Nassir Navab; Junjun He; Zongyuan Ge

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu, Kun Yuan, Yaling Shen, Feilong Tang, Xiaohao Xu, Lin Zhou, Wei Li, Ying Chen, Zhongxing Xu, Zelin Peng, Siyuan Yan, Vinkle Srivastav, Diping Song, Tianbin Li, Danli Shi, Jin Ye, Nicolas Padoy, Nassir Navab, Junjun He, Zongyuan Ge

TL;DR

This work tackles the challenge of ophthalmic surgical video-language understanding by constructing OphVL, a large-scale, hierarchically structured dataset with 375K clip-text pairs and 30K silent videos, and proposing OphCLIP, a hierarchical retrieval-augmented pretraining framework. OphCLIP learns fine-grained clip-level representations from narrated content and coarse-grained video-level representations from titles, while leveraging a silent-video knowledge pool via a dynamic memory bank and MIPS-based retrieval to enhance long-term understanding. Through extensive zero-shot and few-/full-shot experiments across 11 downstream datasets, OphCLIP achieves state-of-the-art performance in surgical phase and multi-instrument recognition, demonstrating strong generalization and practical potential for ophthalmic AI tools. The combination of hierarchical video-text correspondences and retrieval-augmented learning offers a scalable path toward robust, context-aware surgical VLP in real-world settings.

Abstract

Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP's robust generalization and superior performance.

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

TL;DR

Abstract

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)