Table of Contents
Fetching ...

VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics

Ammar A. Siddiqui, Santosh Tirunagari, Tehseen Zia, David Windridge

TL;DR

VALD-MD introduces a diffusion-based Visual Attribution framework for medical imaging that generates healthy counterfactuals to reveal diagnostically relevant regions. By coupling latent diffusion with a domain-adapted language model (RadBERT) and image priors, it produces medically grounded counterfactuals conditioned on natural-language prompts, enabling subtractive attribution maps $M(I^a)=I^a-I^n$. The approach is validated on chest X-ray datasets using FID, SSIM, and MS-SSIM, demonstrating plausible counterfactuals and robust VA maps, with notable zero-shot localization capabilities. The work highlights modest data requirements and potential for open-ended diagnostic exploration, offering a path toward interpretable, controllable visual explanations in clinical practice.

Abstract

Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image, in contrast to the more common detection of diseased tissue deployed in standard machine vision pipelines (which are less straightforwardly interpretable/explainable to clinicians). We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models, in order to generate normal counterparts of abnormal images. The discrepancy between the two hence gives rise to a mapping indicating the diagnostically-relevant image components. To achieve this, we deploy image priors in conjunction with appropriate conditioning mechanisms in order to control the image generative process, including natural language text prompts acquired from medical science and applied radiology. We perform experiments and quantitatively evaluate our results on the COVID-19 Radiography Database containing labelled chest X-rays with differing pathologies via the Frechet Inception Distance (FID), Structural Similarity (SSIM) and Multi Scale Structural Similarity Metric (MS-SSIM) metrics obtained between real and generated images. The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction, which are evaluated with real examples from the cheXpert dataset.

VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics

TL;DR

VALD-MD introduces a diffusion-based Visual Attribution framework for medical imaging that generates healthy counterfactuals to reveal diagnostically relevant regions. By coupling latent diffusion with a domain-adapted language model (RadBERT) and image priors, it produces medically grounded counterfactuals conditioned on natural-language prompts, enabling subtractive attribution maps . The approach is validated on chest X-ray datasets using FID, SSIM, and MS-SSIM, demonstrating plausible counterfactuals and robust VA maps, with notable zero-shot localization capabilities. The work highlights modest data requirements and potential for open-ended diagnostic exploration, offering a path toward interpretable, controllable visual explanations in clinical practice.

Abstract

Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image, in contrast to the more common detection of diseased tissue deployed in standard machine vision pipelines (which are less straightforwardly interpretable/explainable to clinicians). We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models, in order to generate normal counterparts of abnormal images. The discrepancy between the two hence gives rise to a mapping indicating the diagnostically-relevant image components. To achieve this, we deploy image priors in conjunction with appropriate conditioning mechanisms in order to control the image generative process, including natural language text prompts acquired from medical science and applied radiology. We perform experiments and quantitatively evaluate our results on the COVID-19 Radiography Database containing labelled chest X-rays with differing pathologies via the Frechet Inception Distance (FID), Structural Similarity (SSIM) and Multi Scale Structural Similarity Metric (MS-SSIM) metrics obtained between real and generated images. The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction, which are evaluated with real examples from the cheXpert dataset.
Paper Structure (22 sections, 9 equations, 7 figures, 4 tables)

This paper contains 22 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The counterfactual generation pipeline takes as input the starting abnormal image $x^a$, which is encoded by the VAE encoder ($\epsilon$) to form the encoded image latents $Z$ and passed through the diffusion process to form noised latents of the image $Z_T$ after incremental $t$ steps. The fine-tuned conditional U-net denoises the latents into the conditioned latent $Z$, decoded by the VAE decoder $D$ into the final generated counterfactual $x^n$, from which a map M($x^n$) is generated explicitly
  • Figure 2: Healthy Counterfactual Generation for three cases of lung opacity (Red indicates generated tissue by the model)
  • Figure 3: Healthy Counterfactual Generation (Red indicates generated tissue by the model)
  • Figure 4: Zero shot carcinoma induction with a real example marked by experts
  • Figure 5: Induction of Cardiomegaly in real healthy scans
  • ...and 2 more figures