Table of Contents
Fetching ...

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

Lennart Maack, Alexander Schlaefer

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The proposed SurgSTU-Pipeline for generating spatial-temporal understanding tasks leverages surgical videos with optionally available metdata, e.g. instrument-tissue interaction- or instrument localization information. By using predefined spatial-temporal task templates, the SurgSTU pipeline creates diverse Question-Answer pairs for training and evaluating Vision-Language Models.
  • Figure 2: Overview of the proposed SurgSTU-Pipeline. The pipeline is composed of two primary steps: Metadata Extension $\&$ Structuring and Spatial-Temporal QA Generation. First, an active learning paradigm extends raw surgical videos with instrument localizations information, which are then consolidated with existing metadata into a structured Event Tuple format. Subsequently, the Spatial-Temporal QA generation module applies spatial and temporal continuity filtering to these tuples to mitigate annotation noise. Finally, the filtered sequences are processed through predefined task templates to deterministically generate a multimodal dataset comprising diverse, fine-grained spatial-temporal question-answer pairs.