Table of Contents
Fetching ...

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

TL;DR

HecVL tackles the challenge of generalizing surgical phase recognition by introducing hierarchical video-language pretraining that leverages three levels of textual supervision—clip-level narrations $N_i$, phase-level concepts $C_i$, and video-level abstracts $A_i$—to learn separate embedding spaces $S_{narration}$, $S_{concept}$, and $S_{abstract}$ via a fine-to-coarse contrastive framework. A single pair of encoders $F_v$ and $F_t$ is trained with three aligned losses: $L_{clip}$ for clip-level narration, and $L_{phase}$ and $L_{video}$ for phase- and video-level semantics, using InfoNCE with temperature $\tau$. The approach is evaluated on zero-shot surgical phase recognition across multiple procedures (e.g., Cholec80, AutoLaparo) and across medical centers (StrasBypass70, BernBypass70), demonstrating state-of-the-art cross-dataset and cross-center transfer without manual annotations. The results reveal that maintaining distinct embeddings per hierarchy (vs a single shared space) yields more robust and transferable representations, highlighting the value of hierarchical textual supervision for generalist surgical understanding. Overall, HecVL advances zero-shot generalization in surgical video analysis, enabling broader applicability across procedures and centers with minimal annotation effort.

Abstract

Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers. The code is available at https://github.com/CAMMA-public/SurgVLP

HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition

TL;DR

HecVL tackles the challenge of generalizing surgical phase recognition by introducing hierarchical video-language pretraining that leverages three levels of textual supervision—clip-level narrations , phase-level concepts , and video-level abstracts —to learn separate embedding spaces , , and via a fine-to-coarse contrastive framework. A single pair of encoders and is trained with three aligned losses: for clip-level narration, and and for phase- and video-level semantics, using InfoNCE with temperature . The approach is evaluated on zero-shot surgical phase recognition across multiple procedures (e.g., Cholec80, AutoLaparo) and across medical centers (StrasBypass70, BernBypass70), demonstrating state-of-the-art cross-dataset and cross-center transfer without manual annotations. The results reveal that maintaining distinct embeddings per hierarchy (vs a single shared space) yields more robust and transferable representations, highlighting the value of hierarchical textual supervision for generalist surgical understanding. Overall, HecVL advances zero-shot generalization in surgical video analysis, enabling broader applicability across procedures and centers with minimal annotation effort.

Abstract

Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model's transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers. The code is available at https://github.com/CAMMA-public/SurgVLP
Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Hierarchical video-text pairs in surgical lecture videos. Conventional methods yuan2023learning utilize only clip-level video-text pairs, while our HecVL utilizes different hierarchical levels of pairs to perform video-language pretraining.
  • Figure 2: Pipeline of the HecVL approach. (a) Conventional video-language methods embed video clips and texts of different granularities into the same embedding space. (b) The HecVL approach considers the granularity differences and constructs three embedding spaces for clip-, phase-, and video-level representation learning. (c) The fine-grained embedding space ($S_{narration}$) is learned first, followed by learning of coarse-space embedding spaces ($S_{abstract}$ and $S_{concept}$) using a temporal aggregation function to aggregate the visual and the textual embeddings.