Table of Contents
Fetching ...

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

Xuechen Guo, Wenhao Chai, Shi-Yan Li, Gaoang Wang

TL;DR

This paper proposes a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning, and devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics.

Abstract

Multimodal Large Language Model (MLLM) has recently garnered attention as a prominent research focus. By harnessing powerful LLM, it facilitates a transition of conversational generative AI from unimodal text to performing multimodal tasks. This boom begins to significantly impact medical field. However, general visual language model (VLM) lacks sophisticated comprehension for medical visual question answering (Med-VQA). Even models specifically tailored for medical domain tend to produce vague answers with weak visual relevance. In this paper, we propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy common to medical scenes is ignored in most prior works. In cases of a single text paired with multiple figures, we utilize weighted scoring with knowledge distillation to adaptively screen valid images mirroring text descriptions. For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital. We create instruction-following data based on text from professional doctors, which ensures effective tuning. With enhanced model and quality data, our Large Chinese Language and Vision Assistant for Ultrasound (LLaVA-Ultra) shows strong capability and robustness to medical scenarios. On three Med-VQA datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics.

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound

TL;DR

This paper proposes a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning, and devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics.

Abstract

Multimodal Large Language Model (MLLM) has recently garnered attention as a prominent research focus. By harnessing powerful LLM, it facilitates a transition of conversational generative AI from unimodal text to performing multimodal tasks. This boom begins to significantly impact medical field. However, general visual language model (VLM) lacks sophisticated comprehension for medical visual question answering (Med-VQA). Even models specifically tailored for medical domain tend to produce vague answers with weak visual relevance. In this paper, we propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy common to medical scenes is ignored in most prior works. In cases of a single text paired with multiple figures, we utilize weighted scoring with knowledge distillation to adaptively screen valid images mirroring text descriptions. For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital. We create instruction-following data based on text from professional doctors, which ensures effective tuning. With enhanced model and quality data, our Large Chinese Language and Vision Assistant for Ultrasound (LLaVA-Ultra) shows strong capability and robustness to medical scenarios. On three Med-VQA datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics.

Paper Structure

This paper contains 15 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Data redundancy common in medical scenes puts a need for fine-grained perception and adaption in MLLM.
  • Figure 2: Overview of our proposed LLaVA-Ultra. Beyond employing the conventional MLLM's architecture, it achieves visual enhancement via a fusion module to incorporate fine-grained SAM features. Additionally, our model can adapt the data redundancy commonly occurred in medical scenarios by two designed automatic sampling strategies.
  • Figure 3: An example of our GPT-3.5 chatgpt generated instruction-following data. Top: A professional multimodal instance from our Chinese ultrasound hospital dataset. It exists data redundancy where a text corresponds to multiple images, but only those mirroring textual descriptions are valid (e.g., display lesions mentioned in the text). Bottom: The instruction-following data generated by GPT-3.5 using the textual descriptions.
  • Figure 4: Comparisons in medical visual conversations. LLaVA and LLaVA-Med tend to give RGB]252,237,169vague answers irrelevant to images and RGB]252,169,173wrong results. In contrast, LLaVA-Ultra offers more RGB]209,225,178correct and specific responses associated with visual contents.
  • Figure 5: Case study: downstream English tasks reveals Chinese pretrain does not visibly damage early LLM knowledge.