Table of Contents
Fetching ...

SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning

Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, Muhammad Abdullah Jamal

TL;DR

SurgLaVi introduces the largest hierarchically structured surgical vision language dataset to date and demonstrates that high quality large scale multi level annotations markedly improve downstream recognition and retrieval across phase step action and tool tasks. The authors build a lightweight CLIP style model SurgCLIP trained on SurgLaVi and SurgLaViβ that leverages multi level clip caption data with dynamic temporal sampling and a dual encoder architecture to achieve strong zero shot and few shot transfer to diverse surgical benchmarks. A four stage fully automated data processing pipeline generates temporally precise and semantically rich clip caption pairs at coarse mid and fine granularity, supplemented by contextual caption enrichment. The results show that dataset scale diversity and hierarchical structure can outperform more complex model architectures, enabling robust surgical foundation models with reduced training cost and broad applicability to workflow understanding and multimodal reasoning in surgical AI.

Abstract

Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and featuring hierarchical levels at coarse-, mid-, and fine-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-$\b{eta}$, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of the SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.

SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning

TL;DR

SurgLaVi introduces the largest hierarchically structured surgical vision language dataset to date and demonstrates that high quality large scale multi level annotations markedly improve downstream recognition and retrieval across phase step action and tool tasks. The authors build a lightweight CLIP style model SurgCLIP trained on SurgLaVi and SurgLaViβ that leverages multi level clip caption data with dynamic temporal sampling and a dual encoder architecture to achieve strong zero shot and few shot transfer to diverse surgical benchmarks. A four stage fully automated data processing pipeline generates temporally precise and semantically rich clip caption pairs at coarse mid and fine granularity, supplemented by contextual caption enrichment. The results show that dataset scale diversity and hierarchical structure can outperform more complex model architectures, enabling robust surgical foundation models with reduced training cost and broad applicability to workflow understanding and multimodal reasoning in surgical AI.

Abstract

Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and featuring hierarchical levels at coarse-, mid-, and fine-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of the SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.

Paper Structure

This paper contains 29 sections, 11 equations, 30 figures, 13 tables.

Figures (30)

  • Figure 1: Overview of SurgLaVi. (A) Hierarchical clip–caption pairs at coarse, mid, and fine levels provide multi-scale temporal granularity for language-supervised analysis. (B) Specialty and subject video distribution, demonstrating broad coverage across diverse surgical domains. (C) Zero-shot phase recognition F1 performance across five datasets, where SurgCLIP trained on SurgLaVi and SurgLaVi$-\mathbf{\beta}$ surpass prior state-of-the-art approaches. (D) Dataset scale comparison, showing that SurgLaVi and SurgLaVi$-\mathbf{\beta}$ substantially exceed existing surgical VLP datasets.
  • Figure 1: Textual prompts for zero-shot classification into surgical (green) and non-surgical (red) classes using SigLIP. Class embeddings are obtained by averaging multiple prompts.
  • Figure 2: SurgLaVi Data Processing Pipeline Overview.Stage 1: Speech-to-text conversion with fine-grained timestamps. Stage 2: Hierarchical semantic transcript segmentation followed by video–text alignment to generate semantic clip-caption pairs at coarse-, mid-, and fine-level of granularity. Stage 3: Dual-modality filtering for surgical visual relevance and textual descriptiveness pairs. Stage 4: Contextual caption enrichment using prior context and metadata to enhance clip-caption semantic alignment.
  • Figure 2: Filtering impact on SurgLaVi$-\mathbf{\beta}$ and SurgLaVi. Clip–caption pairs are progressively reduced through surgical filtering, descriptiveness filtering, and the intersection of both.
  • Figure 3: Hierarchical clip-caption pair structure. Example illustrating the three hierarchical levels of granularity in the dataset: (1) Coarse-level pairs have long procedural coverage with complex context-rich descriptions, but lower temporal granularity; (2) Mid-level pairs provide intermediate temporal granularity with moderate complexity captions; (3) Fine-level pairs provide short-duration segments with high temporal resolution and action-focused descriptions
  • ...and 25 more figures