Table of Contents
Fetching ...

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

Juseong Jin, Chang Wook Jeong

TL;DR

A LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios, which exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions.

Abstract

Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

TL;DR

A LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios, which exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions.

Abstract

Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.

Paper Structure

This paper contains 16 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example to illustrate the instruction-following data. We utilized the original caption to create an annotation that follows instructions with various prompts. The video and caption were acquired from Cholec80 dataset hong2020cholecseg8k. The instruction-following data generated by GPT-3.5 using the text only (captions).
  • Figure 2: Architecture of Surgical-LLaVA. We adopted llava as the baseline, which vicuna as the LLM model and the pre-trained CLIP visual encoder ViT-L/14 as the visual model. The training involves encoding these inputs into token representations, followed by joint contrastive learning to align modalities within the semantic space.
  • Figure 3: Example comparison of surgical visual chat and reasoning capabilities. Compared to Video-LLaVA lin2023video, Surgical-LLaVA offers specific and accurate answers to surgical scenarios.
  • Figure 4: Examples from Surgical-LLaVA's demonstration of video reasoning. It shows conversation, detail description and complex reasoning cases.