Table of Contents
Fetching ...

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

Jiajie Li, Garrett Skinner, Gene Yang, Brian R Quaranto, Steven D Schwaitzberg, Peter C W Kim, Jinjun Xiong

TL;DR

This work tackles the need for domain-specific multimodal dialogue over surgical videos by introducing Surg-QA, a large-scale surgical video instruction-tuning dataset (~102K QA pairs from ~2,151 lecture videos) generated via a two-stage QA pipeline that mitigates LLM hallucination. Building on Surg-QA, the authors train LLaVA-Surg, a video-language model tailored to surgical content, using a Video-ChatGPT–style architecture with a frozen CLIP encoder and a fine-tuned LLaVA-Med backbone, achieving state-of-the-art zero-shot performance on surgical video QA. The approach combines structured knowledge extraction (observations, reasoning, plans, deductions) with visual concept alignment, enabling robust multi-turn conversations about surgical videos. Quantitative and qualitative evaluations, including GPT-based scoring and human correlation (ρ = 0.94), demonstrate superior performance over general-domain and surgical-domain baselines, with a commitment to open-source release to accelerate research in surgical AI applications.

Abstract

Multimodal large language models (LLMs) have achieved notable success across various domains, while research in the medical field has largely focused on unimodal images. Meanwhile, current general-domain multimodal models for videos still lack the capabilities to understand and engage in conversations about surgical videos. One major contributing factor is the absence of datasets in the surgical field. In this paper, we create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs, the largest of its kind so far. To build such a dataset, we propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge in a structured manner from the publicly available surgical lecture videos. The pipeline breaks down the generation process into two stages to significantly reduce the task complexity, allowing us to use a more affordable, locally deployed open-source LLM than the premium paid LLM services. It also mitigates the risk of LLM hallucinations during question-answer generation, thereby enhancing the overall quality of the generated data. We further train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos, on this Surg-QA dataset, and conduct comprehensive evaluations on zero-shot surgical video question-answering tasks. We show that LLaVA-Surg significantly outperforms all previous general-domain models, demonstrating exceptional multimodal conversational skills in answering open-ended questions about surgical videos. We will release our code, model, and the instruction-tuning dataset.

LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

TL;DR

This work tackles the need for domain-specific multimodal dialogue over surgical videos by introducing Surg-QA, a large-scale surgical video instruction-tuning dataset (~102K QA pairs from ~2,151 lecture videos) generated via a two-stage QA pipeline that mitigates LLM hallucination. Building on Surg-QA, the authors train LLaVA-Surg, a video-language model tailored to surgical content, using a Video-ChatGPT–style architecture with a frozen CLIP encoder and a fine-tuned LLaVA-Med backbone, achieving state-of-the-art zero-shot performance on surgical video QA. The approach combines structured knowledge extraction (observations, reasoning, plans, deductions) with visual concept alignment, enabling robust multi-turn conversations about surgical videos. Quantitative and qualitative evaluations, including GPT-based scoring and human correlation (ρ = 0.94), demonstrate superior performance over general-domain and surgical-domain baselines, with a commitment to open-source release to accelerate research in surgical AI applications.

Abstract

Multimodal large language models (LLMs) have achieved notable success across various domains, while research in the medical field has largely focused on unimodal images. Meanwhile, current general-domain multimodal models for videos still lack the capabilities to understand and engage in conversations about surgical videos. One major contributing factor is the absence of datasets in the surgical field. In this paper, we create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs, the largest of its kind so far. To build such a dataset, we propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge in a structured manner from the publicly available surgical lecture videos. The pipeline breaks down the generation process into two stages to significantly reduce the task complexity, allowing us to use a more affordable, locally deployed open-source LLM than the premium paid LLM services. It also mitigates the risk of LLM hallucinations during question-answer generation, thereby enhancing the overall quality of the generated data. We further train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos, on this Surg-QA dataset, and conduct comprehensive evaluations on zero-shot surgical video question-answering tasks. We show that LLaVA-Surg significantly outperforms all previous general-domain models, demonstrating exceptional multimodal conversational skills in answering open-ended questions about surgical videos. We will release our code, model, and the instruction-tuning dataset.
Paper Structure (21 sections, 3 equations, 9 figures, 8 tables)

This paper contains 21 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Surgical Knowledge Pyramid. Surgical video interpretation can be categorized into four levels. The first two levels represent the observation capabilities, which can be captured by traditional computer vision tasks such as object detection, segmentation, and labeling. But this only conveys a superficial level of understanding. The next two levels represent the reasoning capabilities. Interpretation at the reasoning levels provides the rationale behind the observations, further offering deductions and plannings, conveying deep, surgical expert-level understanding.
  • Figure 2: Instruction-Tuning Data Generation Pipeline. Top: Structured surgical video learning begins with untrimmed lecture videos divided into clips. Expert narrations (transcripts) from the lectures are converted to text using WhisperX bain2022whisperx. We then prompt Llama-3-70B to extract the structured information from the transcripts. Finally, the extracted information is provided to Llama-3-70B to generate the instruction-tuning data. Bottom: Surgical visual concept alignment data are concise descriptions of surgical videos, generated based on surgical action triplets.
  • Figure 3: Comparison of instruction-tuning data generated by our two-stage approach (top) and the previous end-to-end approach (bottom). Both approaches were given the same video title and transcript. Our approach accurately extracted information from the transcript, generating correct question-answer pairs. In contrast, the conventional end-to-end approach produced incorrect question-answer pairs due to hallucination.
  • Figure 4: The data statistics of surgical multimodal instruction-tuning data: (a,b) The root verb-noun pairs provide an overview of our dataset of instructions and responses. In the plot, the inner circle represents the root verb of the response, and the outer circle represents the direct nouns. (c) The distribution of videos of different types. (d) The distribution of video and QA pairs on 11 categories.
  • Figure 5: Human Expert vs GPT-3.5-Turbo Evaluation. Spearman's rank correlation coefficient $\rho=0.94$.
  • ...and 4 more figures