Biomedical Visual Instruction Tuning with Clinician Preference Alignment

Hejie Cui; Lingjun Mao; Xin Liang; Jieyu Zhang; Hui Ren; Quanzheng Li; Xiang Li; Carl Yang

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

Hejie Cui, Lingjun Mao, Xin Liang, Jieyu Zhang, Hui Ren, Quanzheng Li, Xiang Li, Carl Yang

TL;DR

BioMed-VITAL tackles the scarcity and misalignment of domain-specific instructional data for biomedical vision-language models by introducing a clinician preference guided data-centric pipeline. It combines generation guided by diverse clinician-selected demonstrations with a mixed-preference data selection model that distills clinician and model judgments into high-quality instruction data. Fine-tuning a LLaVA-based biomedical model on the distilled data yields substantial gains in open-ended visual chat and biomedical VQA benchmarks, with win rates reaching up to 81.73%. By releasing 80K clinician-aligned instruction datasets and associated models, the work provides a practical pathway for deploying clinician-aware multimodal models in real-world biomedical settings.

Abstract

Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at BioMed-VITAL.github.io.

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

TL;DR

Abstract

Paper Structure (27 sections, 4 equations, 11 figures, 4 tables)

This paper contains 27 sections, 4 equations, 11 figures, 4 tables.

Introduction
Background
Clinician-Aligned Biomedical Visual Instruction Tuning
Stage 1: Data Generation with Diverse Expert-Selected Demonstration
Stage 2: Distilling Mixed Clinician Preference for Data Selection
Stage 3: Instruction-Tuning
Experiments
Dataset and Experiment Details of BioMed-VITAL
Alignment Evaluation of the Data Selection Model
Downstream Evaluation 1: Open-Ended Medical Visual Chat
Downstream Evaluation 2: Performance on Established VQA Benchmarks
Case Study
Conclusion and Discussion
Clinician Preference Annotation
Prompt for Instructional Data Generation
...and 12 more sections

Figures (11)

Figure 1: Overview of Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL). Clinician preferences are infused in the 1. data generation and 2. selection stages.
Figure 2: Left: Comparison of human preference alignment between GPT-4V and our selection model. Right: F1 and precision for varying top K percentile samples ranked by the selection model.
Figure 3: Win rate performance of BioMed-VITAL and its variants compared with LLaVA-Med.
Figure 4: Case study on the generated instruction-following data.
Figure 5: Case study for the downstream task of open-ended visual chat.
...and 6 more figures

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

TL;DR

Abstract

Biomedical Visual Instruction Tuning with Clinician Preference Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (11)