Table of Contents
Fetching ...

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu

Abstract

Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Abstract

Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

Paper Structure

This paper contains 34 sections, 18 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Motivation and architectural comparison for medical visual grounding. (a) Existing VLM-based MVG relies on latent token prompts with unstable attention. (b) Our knowledge-enhanced prompting strategy injects phrase-level priors for improved grounding ability. (c) Global-local attention module combines semantic masks and box cues to achieve accurate and consistent grounding.
  • Figure 2: Overall framework of the proposed medical report grounding method. Given a medical image and its corresponding clinical text, a vision backbone and a multimodal LLM jointly encode visual and textual representations. To enhance spatial grounding, we introduce a knowledge-enhanced prompting strategy that encodes phrase-related anatomical information and selects top-k location prompts. These priors are combined with a global–local attention module to guide the BOX decoder toward clinically relevant regions.
  • Figure 3: Visual comparison of medical report grounding results across different methods on the MRG-MS-CXR dataset.
  • Figure 4: Visual comparison of medical report grounding results across different methods on the MRG-ChestX-ray8 dataset.
  • Figure 5: Visual comparison results across different methods under VQA-based and category-level clinical text inputs.
  • ...and 1 more figures