Table of Contents
Fetching ...

Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

Amar Kumar, Anita Kriz, Barak Pertzov, Tal Arbel

TL;DR

The paper addresses dataset biases and spurious correlations in medical imaging by comparing structural causal models with fine-tuned vision-language foundation models for counterfactual image synthesis. It presents a Stable Diffusion-based approach, enhanced with LANCE and null-text inversion, to generate high-resolution, text-guided counterfactuals and evaluates them against SCM-based baselines using perceptual, identity, and causal-effect metrics. The authors show that VLMs can reveal hidden data properties not captured in metadata, while also exposing potential spurious correlations and editing limitations. This work offers a data-driven pathway to audit and improve clinical imaging AI by uncovering latent relationships, with implications for robust and trustworthy deployment.

Abstract

Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.

Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

TL;DR

The paper addresses dataset biases and spurious correlations in medical imaging by comparing structural causal models with fine-tuned vision-language foundation models for counterfactual image synthesis. It presents a Stable Diffusion-based approach, enhanced with LANCE and null-text inversion, to generate high-resolution, text-guided counterfactuals and evaluates them against SCM-based baselines using perceptual, identity, and causal-effect metrics. The authors show that VLMs can reveal hidden data properties not captured in metadata, while also exposing potential spurious correlations and editing limitations. This work offers a data-driven pathway to audit and improve clinical imaging AI by uncovering latent relationships, with implications for robust and trustworthy deployment.

Abstract

Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison of counterfactual generation in the SOTA structural causal model (SCM) and a foundation vision-language model. Inference involves generating images with increased age, which may have downstream effects on disease. Boxes highlight changes related to disease state, while boxes indicate changes associated with age (e.g., decreased bone and tissue density). Given that the SCM model is built upon an HVAE and cannot generate high-resolution images, it is limited in its ability to capture plausible counterfactual generation at finer detail levels.
  • Figure 2: Comparison of counterfactual image generation results using [Row 1] proposed method vs. [Row 2] baseline, a SOTA method that employs an explicit SCM for generation. For the proposed method, counterfactual images are generated by modifying the text prompt against those generated by performing interventions using the do(.) operator in the baseline method for the attributes: sex and pleural effusion. [Row 3]: Modifications on the age attribute. According to the baseline pre-defined SCM, there exists an edge between age (a in the SCM) and pleural effusion (d), indicating a possible causal effect. Without defining the explicit SCM, proposed method demonstrates an effect on pleural effusion when modifying age. The regions anticipated to undergo changes are indicated in boxes. Note that the SCM method significantly alters the synthesis of CF images, resulting in changes to the subject's anatomical structure.
  • Figure 3: Revealing hidden image-attribute relationships from prompt modifications. [Column 1]: Original Images with corresponding prompts Chest X-ray with cardiomegaly and support devices. [Column 2]: Counterfactual Image with modified text prompt Chest X-ray with no cardiomegaly. [Column 3:] Counterfactual image with modified text prompt Chest X-ray with no support devices. Notably, removing cardiomegaly also results in the specific removal of the pacemaker, but not the other support devices, suggesting a hidden correlation in the training data. This relationship is supported by the literature koo2017pacinggul2024pacemakerkhan2023case. (Differences are best seen when zoomed in.)