Table of Contents
Fetching ...

Generalised Medical Phrase Grounding

Wenjun Zhang, Shekhar S. Chandra, Aaron Nicolson

TL;DR

<3-5 sentence high-level summary> Generalised Medical Phrase Grounding (GMPG) reframes medical phrase grounding to support zero-to-many grounded regions per phrase and provides confidence scores, addressing limitations of prior MPG. MedGrounder, a DETR-like medical grounding model, uses a two-stage training regime: weakly supervised pretraining on Chest ImaGenome and fine-tuning on expert PadChest-GR and MS-CXR data, achieving strong zero-shot transfer and state-of-the-art results on multi-region grounding. The approach enables modular grounded report generation by pairing with existing radiology generators, producing grounded reports without retraining generators, and demonstrates substantial gains especially for multi-region and non-groundable phrases. Limitations remain for spatially diffuse findings and small datasets, motivating further data and priors to improve spatial generalisation and broader clinical deployment.

Abstract

Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.

Generalised Medical Phrase Grounding

TL;DR

<3-5 sentence high-level summary> Generalised Medical Phrase Grounding (GMPG) reframes medical phrase grounding to support zero-to-many grounded regions per phrase and provides confidence scores, addressing limitations of prior MPG. MedGrounder, a DETR-like medical grounding model, uses a two-stage training regime: weakly supervised pretraining on Chest ImaGenome and fine-tuning on expert PadChest-GR and MS-CXR data, achieving strong zero-shot transfer and state-of-the-art results on multi-region grounding. The approach enables modular grounded report generation by pairing with existing radiology generators, producing grounded reports without retraining generators, and demonstrates substantial gains especially for multi-region and non-groundable phrases. Limitations remain for spatially diffuse findings and small datasets, motivating further data and priors to improve spatial generalisation and broader clinical deployment.

Abstract

Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.

Paper Structure

This paper contains 34 sections, 3 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Top: MPG predicts a single bounding box per sentence, while GMPG additionally supports (i) multiple boxes per sentence, (ii) suppression of boxes for non-groundable phrases (e.g., negations), and (iii) confidence scoring for all predictions. Bottom: Applications of GMPG: (A) grounding a radiologist-written report for patient comprehension, and (B) grounding an AI-generated report for radiologist verification.
  • Figure 2: MedGrounder architecture overview. ResNet-101 and BioClinical MobernBERT encode image and phrase features, respectively, which are concatenated and fed into a cross-encoder. From the features of the cross-encoder, a Transformer decoder then predicts a set of scored bounding boxes corresponding to the phrase, with an example shown on the right.
  • Figure 3: LLM prompt for filtering redundant anatomical regions. The model selects the most specific regions from candidates, discarding broader parent regions.
  • Figure 4: Qualitative comparison of grounding results from TransVG, MedRPG, MAIRA-2 GR, and the proposed MedGrounder on three PadChest-GR examples. Ground-truth (GT) is shown in the first column, model predictions in the remaining columns. Numbers in brackets after each sentence indicate the number of bounding boxes in the GT.
  • Figure 5: Correlation heatmap on MS-CXR. Performance vs. data properties; cell text shows Pearson $r$ and $p$-values. Asterisks indicate statistical significance levels, ** $p<0.01$, and * $p<0.05$.
  • ...and 2 more figures