Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

Shenghuan Sun; Alexander Schubert; Gregory M. Goldgof; Zhiqing Sun; Thomas Hartvigsen; Atul J. Butte; Ahmed Alaa

Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

Shenghuan Sun, Alexander Schubert, Gregory M. Goldgof, Zhiqing Sun, Thomas Hartvigsen, Atul J. Butte, Ahmed Alaa

TL;DR

A new alignment algorithm is proposed that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge and eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback.

Abstract

Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions to assist in diagnostic and treatment tasks. However, VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information. This challenge is particularly pronounced in the medical domain, where we do not only require VLM outputs to be accurate in single interactions but also to be consistent with clinical reasoning and diagnostic pathways throughout multi-turn conversations. For this purpose, we propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge. These representations are utilized to (i) generate GPT-4-guided visual instruction tuning data at scale, simulating clinician-VLM conversations with demonstrations of clinical reasoning, and (ii) create an automatic reward function that evaluates the clinical validity of VLM generations throughout clinician-VLM interactions. Our algorithm eliminates the need for human involvement in training data generation or reward model construction, reducing costs compared to standard reinforcement learning with human feedback (RLHF). We apply our alignment algorithm to develop Dr-LLaVA, a conversational VLM finetuned for analyzing bone marrow pathology slides, demonstrating strong performance in multi-turn medical conversations.

Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 4 figures, 5 tables)

This paper contains 20 sections, 3 equations, 4 figures, 5 tables.

Introduction
Visual Instruction Tuning with Symbolic Clinical Grounding
Results
Conclusion
Data
Additional Medical Context
Generating a multi-turn conversation dataset
Instruction tuning details
Multimodal Supervised Finetuning
VLM response labelling
Training details
Additional Experimental Results
Evaluation metrics
Performance given diverse conversation sequences
Performance in case of misleading clinician hypothesis
...and 5 more sections

Figures (4)

Figure 1: Symbolic representation of clinical reasoning in blood cancer diagnosis.
Figure 2: Pictorial depiction of the Dr-LLaVA training pipeline (a) Multi-turn conversations consistent with symbolic clinical reasoning are generated for each medical image, utilizing GPT-4 for diverse phrasing. (b) A symbolic reward function evaluates VLM responses, checking individual correctness and clinical validity. (c) Using the dataset from (a) and the reward model from (b), a pretrained VLM is finetuned via RL.
Figure C.1: Impact of the hyperparameter $\lambda$ on Dr-LLaVA performance.
Figure C.2: Example outputs of the Dr-LLaVA model and baselines.

Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

TL;DR

Abstract

Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)