Table of Contents
Fetching ...

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang, Soopil Kim, Xiaoxiao Li, Sang Hyun Park

Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
Paper Structure (26 sections, 4 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 4 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Conceptual illustrations of (a) instruction tuning, which fine-tunes the model on curated instructions and outputs, and (b) instruction-free tuning, which fine-tunes the model solely on paired textual descriptions (e.g., radiology reports). Instruction tuning requires instruction-image-output triplets constructed by humans or LLMs prior to fine-tuning, whereas instruction-free tuning can be performed on image-description pairs without additional steps. Examples of an image, an instruction-output pair for instruction tuning, and a description used for instruction-free tuning are shown in (c).
  • Figure 2: An illustration of our instruction-free tuning framework. The vision encoder $g$ extracts key-value matrices ${KV}_v$ from an image $X_v$, which are then integrated into a prompt $X_p$ for the language model $f$ to generate the response $\hat{y}$. For instruction-free tuning, the text instruction in $X_p$ is replaced with the momentum proxy instruction $\bar{t}$. During supervised fine-tuning, $g$ is updated with autoregressive loss $L$ between the response $\hat{y}$ and the ground truth $y$. In parallel, (warm-up initialized) $\bar{t}$ is gradually updated via exponential moving average of the proxy instruction $t$. During inference, $\bar{t}$ is discarded, and a conversational text instruction (e.g., "Describe...") is used to generate a natural language response (e.g., "The image depicts...").
  • Figure 3: Examples of the medical report for the (a) SKINCON, (b) WBCAtt, (c) CBIS, and (d) MIMIC-CXR datasets.
  • Figure 4: Examples of multiple-choice VQA for the (a) SKINCON, (b) WBCAtt, (c) CBIS, and (d) MIMIC-CXR datasets.
  • Figure 5: Radar charts of the accuracy for each attribute (or question type) across the (a) WBCAtt, (b) CBIS, and (c) MIMIC-CXR datasets. LLaMA-3.2-11B-Vision (w/o FT) is shown in black, LLaMA-3.2-11B-Vision (FT) in green, InstFree in blue, InstFree w/ RS in red, and MedGemma-4B in yellow-green.
  • ...and 3 more figures