A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Xiaoshuang Huang; Haifeng Huang; Lingdong Shen; Yehui Yang; Fangxin Shang; Junwei Liu; Jia Liu

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Xiaoshuang Huang, Haifeng Huang, Lingdong Shen, Yehui Yang, Fangxin Shang, Junwei Liu, Jia Liu

TL;DR

This paper tackles the lack of fine-grained refer-and-ground capabilities in biomedical multimodal models by introducing the Med-GRIT-270k dataset, which converts image-mask pairs from biomedical segmentation data into instruction-tuned dialogues via ChatGPT across eight imaging modalities. It then presents BiRD, a biomedical refer-and-ground multimodal LLM built on a Qwen-VL backbone, trained in a single stage with multi-task instruction learning to preserve conversational ability while enabling precise region referencing and grounding. Key contributions include the first biomedical refer-and-ground dataset and the first fine-tuned model (BiRD) for this capability, validated by extensive experiments showing data-scale gains and robust cross-modal interaction, though with noted limitations such as object hallucination from a frozen visual encoder. The work promises to advance intelligent biomedical assistants and provides dataset/code releases to accelerate community development and benchmarking in this niche. $

Abstract

With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

TL;DR

Abstract

Paper Structure (8 sections, 5 figures, 2 tables)

This paper contains 8 sections, 5 figures, 2 tables.

Introduction
Related Work
Med-GRIT-270k: Biomedical Ground-and-Refer Instruction-tuning Dataset
Multi-task Instruction Learning
Model Architecture
Multi-task Instruction Training
Experiments
Conclusion

Figures (5)

Figure 1: BiRD empowers multimodal large language models in biomedicine with sophisticated referring and grounding capabilities. For more equitable comparison, we append spatial information to each LLaVa-med test, such as "The image size is [w, h], and the origin of the coordinate system is located in the upper left corner of the image.", where w and h denote width and height, respectively.
Figure 2: An instance of our generated instruction-following data. Top: the meta information is created according to rules in medical segmentation datasets, and the image caption was generated from chatGPT. Bottom: the instruction following data generated by chatGPT.
Figure 3: Overreview. Left: the training set (Top) and test set (Bottom) distribution of conversation turns in Med-GRIT-270k we collected. Right: the architecture of the Biomedical refer-and-ground multimodal large language model (BiRD), which is based on Qwen-VL bai2023qwen2. We have developed it from the 240k data and evaluated it on 30k data.
Figure 4: The capabilities of various biomedical MLLMs. Note that the "Modality" denotes image modality.
Figure 5: The example of object hallucination in BiRD.

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

TL;DR

Abstract

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Authors

TL;DR

Abstract

Table of Contents

Figures (5)