STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Guohao Sun; Can Qin; Huazhu Fu; Linwei Wang; Zhiqiang Tao

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, Zhiqiang Tao

TL;DR

Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med) is introduced, designed to train a policy model capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO).

Abstract

Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medicine (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 6 figures, 4 tables)

This paper contains 16 sections, 2 equations, 6 figures, 4 tables.

Introduction
STLLaVA-Med
Self-training Datasets
Experiments
Implementation
Preference Data Generation
Datasets and Metrics
Overall Performance
Conclusion
Limitations
Ethics Statements
Related Work
Large Vision-Language Model
Experiments
Additional Results
...and 1 more sections

Figures (6)

Figure 1: Left: Comparison of total medical data usage between LLaVA-Med (530K) and STLLaVA-Med (50k). Right: Comparison results on three medical VQA datasets. STLLaVA-Med reports better/comparable performance, using much less medical training data.
Figure 2: Model architecture of STLLaVA-Med and self-training pipeline. Left: stage 1 aiming to optimize the model $\pi_\theta$ improving medical image reasoning and learning to question. Right: in stage 2, we first prompt $\pi_\theta$ to auto-generate preference data under the guidance of GPT-4o, then supervise $\pi_\theta$ for DPO fine-tuning.
Figure 3: Prompt for GPT-4o to grade the answers generated by STLLaVA-Med from stage 1. The answer with the higher score will be designated as the winning response, while the other will be classified as rejected.
Figure 4: Qualitative evaluation of methods w and w/o preference revelation.
Figure 5: Qualitative evaluation of methods w and w/o preference revelation.
...and 1 more figures

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

TL;DR

Abstract

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)