Table of Contents
Fetching ...

Anatomy-Aware Conditional Image-Text Retrieval

Meng Zheng, Jiajin Zhang, Benjamin Planche, Zhongpai Gao, Terrence Chen, Ziyan Wu

TL;DR

This work tackles medical image-text retrieval by incorporating anatomical region conditioning into a multimodal framework. It introduces Region-Relevance-Aligned Vision Language (RRA-VL) with global and region-level alignment and a location-conditioned contrastive loss to enable Location-Conditioned Multimodal Retrieval (LC-MMR). The method achieves state-of-the-art phase grounding on MS-CXR and competitive, region-aware retrieval on MIMIC-loc and cross-domain datasets, while enabling explainability through prompts to general LLMs without domain-specific text generators. A two-stage training regime leverages weak region-level supervision extracted from radiology reports, supporting precise explanations and preliminary diagnoses aligned with anatomical regions. The approach promises practical clinical benefits by providing fine-grained retrieval, region-level explanations, and region-specific diagnostic prompts for radiology cases.

Abstract

Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.

Anatomy-Aware Conditional Image-Text Retrieval

TL;DR

This work tackles medical image-text retrieval by incorporating anatomical region conditioning into a multimodal framework. It introduces Region-Relevance-Aligned Vision Language (RRA-VL) with global and region-level alignment and a location-conditioned contrastive loss to enable Location-Conditioned Multimodal Retrieval (LC-MMR). The method achieves state-of-the-art phase grounding on MS-CXR and competitive, region-aware retrieval on MIMIC-loc and cross-domain datasets, while enabling explainability through prompts to general LLMs without domain-specific text generators. A two-stage training regime leverages weak region-level supervision extracted from radiology reports, supporting precise explanations and preliminary diagnoses aligned with anatomical regions. The approach promises practical clinical benefits by providing fine-grained retrieval, region-level explanations, and region-specific diagnostic prompts for radiology cases.

Abstract

Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.

Paper Structure

This paper contains 22 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Proposed multimodal retrieval system (ALC-ITR) with anatomical region conditioning. Given a query chest X-ray image and a suspicious anatomical region, our proposed system retrieves the most relevant patient cases showing the same or similar disease, for the same anatomical regions. The system then prompts LLMs based on retrieval results for explanation generation and preliminary diagnosis.
  • Figure 2: Proposed Region-Relevance-Aligned medical Vision Language (RRA-VL) model with global and local alignment.
  • Figure 3: Proposed explanation generation pipeline based on location-conditioned multi-modal retrieval. Upper part shows retrieved patient gallery given the query image conditioned on "Heart" region. Bottom part illustrates the explanation generation and evaluation pipeline.
  • Figure 4: Visualization of phase grounding heatmaps of proposed RRA-VL on MS-CXR MSCXR_ECCV22. Red boxes are ground-truth bounding boxes.
  • Figure 5: Statistics of rated consistency score (GPT-4o-mini) of generated explanations and GT descriptions, conditioned at different anatomical regions. Evaluation protocol is described in Figure \ref{['fig:explanation_pipeline']}.
  • ...and 5 more figures