Table of Contents
Fetching ...

Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, Kang Li

TL;DR

The paper tackles the challenge of region-focused understanding in medical vision-language models by introducing MedVP, a framework that automatically generates explicit visual prompts to guide attention to clinically relevant regions. It combines medical entity extraction, open-set grounding (Grounding DINO) for ROI localization, and prompt-based image modulation via alpha blending to fine-tune MedVP-LLaVA on multiple medical VQA benchmarks. Key contributions include an automated prompt generation pipeline, a region-level knowledge alignment objective, and strong empirical gains on SLAKE and VQA-RAD, with robust performance under reduced prompts and insightful attention visualizations. The work advances practical medical AI by improving region-specific reasoning and interpretability, with plans to release models, grounding components, and adapted VQA datasets for broader research use.

Abstract

While mainstream vision-language models (VLMs) have advanced rapidly in understanding image level information, they still lack the ability to focus on specific areas designated by humans. Rather, they typically rely on large volumes of high-quality image-text paired data to learn and generate posterior attention maps. To address this critical issue, we propose leveraging visual prompts:simple visual markers in various forms to guide and enhance the formation of region-specific attention. Thus, we introduce MedVP, a pioneering framework that integrates medical entity extraction, visual prompt generation, and dataset adaptation for visual prompt guided fine-tuning. We successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments and Human evaluation are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach.

Guiding Medical Vision-Language Models with Explicit Visual Prompts: Framework Design and Comprehensive Exploration of Prompt Variations

TL;DR

The paper tackles the challenge of region-focused understanding in medical vision-language models by introducing MedVP, a framework that automatically generates explicit visual prompts to guide attention to clinically relevant regions. It combines medical entity extraction, open-set grounding (Grounding DINO) for ROI localization, and prompt-based image modulation via alpha blending to fine-tune MedVP-LLaVA on multiple medical VQA benchmarks. Key contributions include an automated prompt generation pipeline, a region-level knowledge alignment objective, and strong empirical gains on SLAKE and VQA-RAD, with robust performance under reduced prompts and insightful attention visualizations. The work advances practical medical AI by improving region-specific reasoning and interpretability, with plans to release models, grounding components, and adapted VQA datasets for broader research use.

Abstract

While mainstream vision-language models (VLMs) have advanced rapidly in understanding image level information, they still lack the ability to focus on specific areas designated by humans. Rather, they typically rely on large volumes of high-quality image-text paired data to learn and generate posterior attention maps. To address this critical issue, we propose leveraging visual prompts:simple visual markers in various forms to guide and enhance the formation of region-specific attention. Thus, we introduce MedVP, a pioneering framework that integrates medical entity extraction, visual prompt generation, and dataset adaptation for visual prompt guided fine-tuning. We successfully outperform recent state-of-the-art large models across multiple medical VQA datasets. Extensive experiments and Human evaluation are conducted to analyze the impact of different visual prompt forms and how they contribute to performance improvement. The results demonstrate both the effectiveness and clinical significance of our approach.
Paper Structure (29 sections, 4 equations, 8 figures, 6 tables)

This paper contains 29 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The figure illustrates the difference in the reasoning process of the Vision-Language Model (VLM) with and without visual prompts. By emulating how humans approach vision-language tasks, we use an auxiliary visual prompt, akin to pointing at a specific region, to help the model focus more easily on the relevant details and generate accurate responses.
  • Figure 2: The framework of MedVP: an automated, explicit visual prompt-guided approach for medical vision-language models. The framework first aligns the model with region-level medical knowledge. Additionally, it generates explicit visual prompts by leveraging keywords from the question and visual grounding models, integrating these prompts into medical images to enhance performance on medical VQA tasks.
  • Figure 3: Visualization results of our cross-attention map. The red boxes are the visual prompts generated by our grounding model. As illustrated, the yellow area indicates high attention values and largely overlaps with the visually prompted area.
  • Figure 4: Instructions for inference with the integrated visual prompts.
  • Figure 5: Instructions for generating entities in the entity recognition task.
  • ...and 3 more figures