Table of Contents
Fetching ...

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

Konstantinos Vilouras, Pedro Sanchez, Alison Q. O'Neil, Sotirios A. Tsaftaris

TL;DR

The paper presents a zero shot approach to medical phrase grounding by exploiting cross attention within a frozen Latent Diffusion Model, avoiding any task specific fine tuning. By performing DDIM inversion and aggregating cross attention maps from middle layers and timesteps, it generates heatmaps that localize pathologies described in radiology reports on chest X-ray images. Across the MS-CXR dataset, the method is competitive with state of the art, outperforming some baselines and approaching others, while strictly adhering to a zero shot setting. The approach emphasizes the potential of off the shelf foundation models for medical localization tasks and discusses practical considerations such as computational cost and robustness across pathologies. The authors provide ablations and qualitative analyses to illuminate design choices and propose avenues for future improvements including few shot fine tuning and faster sampling.

Abstract

Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.

Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models

TL;DR

The paper presents a zero shot approach to medical phrase grounding by exploiting cross attention within a frozen Latent Diffusion Model, avoiding any task specific fine tuning. By performing DDIM inversion and aggregating cross attention maps from middle layers and timesteps, it generates heatmaps that localize pathologies described in radiology reports on chest X-ray images. Across the MS-CXR dataset, the method is competitive with state of the art, outperforming some baselines and approaching others, while strictly adhering to a zero shot setting. The approach emphasizes the potential of off the shelf foundation models for medical localization tasks and discusses practical considerations such as computational cost and robustness across pathologies. The authors provide ablations and qualitative analyses to illuminate design choices and propose avenues for future improvements including few shot fine tuning and faster sampling.

Abstract

Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.
Paper Structure (24 sections, 5 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: High-level description of the zero-shot phrase grounding task. Given input pairs of an image (chest X-ray) and its accompanying text prompt, we leverage cross-modal feature alignment mechanisms within a frozen Latent Diffusion Model (LDM) to extract heatmaps, which indicate the regions where image and text are maximally aligned. Then, we evaluate the generated heatmaps based on ground truth bounding boxes (shown in green) for pathology detection. Our method, thus, is an illustration of using pre-trained LDMs for downstream applications in a zero-shot setting.
  • Figure 2: Overview of our proposed phrase grounding pipeline based on the Latent Diffusion Model rombach2022high. The input image-text pair is first processed via the encoders $E$ and $\boldsymbol{\tau_\theta}$, respectively. Then, at each timestep of the diffusion process $t=1,...,T$, we gather cross-attention maps from the U-Net $\boldsymbol{\epsilon_\theta}$. The output heatmap $\mathbf{h}$ is generated by averaging the gathered attention maps.
  • Figure 3: Randomly selected results for the phrase grounding task. For each input image-prompt pair, we show the heatmaps generated from BioViLboecking2022making, BioViL-Tbannur2023learning and our own method, respectively, overlaid on the original images. Ground truth classes are highlighted in bold within each prompt. Ground truth bounding boxes are depicted in green. For each method, we also provide the reported $|$CNR$|$ and mIoU metrics (shown on top of each figure). Best viewed in colour.