Anatomy-Aware Conditional Image-Text Retrieval

Meng Zheng; Jiajin Zhang; Benjamin Planche; Zhongpai Gao; Terrence Chen; Ziyan Wu

Anatomy-Aware Conditional Image-Text Retrieval

Meng Zheng, Jiajin Zhang, Benjamin Planche, Zhongpai Gao, Terrence Chen, Ziyan Wu

TL;DR

This work tackles medical image-text retrieval by incorporating anatomical region conditioning into a multimodal framework. It introduces Region-Relevance-Aligned Vision Language (RRA-VL) with global and region-level alignment and a location-conditioned contrastive loss to enable Location-Conditioned Multimodal Retrieval (LC-MMR). The method achieves state-of-the-art phase grounding on MS-CXR and competitive, region-aware retrieval on MIMIC-loc and cross-domain datasets, while enabling explainability through prompts to general LLMs without domain-specific text generators. A two-stage training regime leverages weak region-level supervision extracted from radiology reports, supporting precise explanations and preliminary diagnoses aligned with anatomical regions. The approach promises practical clinical benefits by providing fine-grained retrieval, region-level explanations, and region-specific diagnostic prompts for radiology cases.

Abstract

Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.

Anatomy-Aware Conditional Image-Text Retrieval

TL;DR

Abstract

Anatomy-Aware Conditional Image-Text Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)