Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging
Amar Kumar, Anita Kriz, Barak Pertzov, Tal Arbel
TL;DR
The paper addresses dataset biases and spurious correlations in medical imaging by comparing structural causal models with fine-tuned vision-language foundation models for counterfactual image synthesis. It presents a Stable Diffusion-based approach, enhanced with LANCE and null-text inversion, to generate high-resolution, text-guided counterfactuals and evaluates them against SCM-based baselines using perceptual, identity, and causal-effect metrics. The authors show that VLMs can reveal hidden data properties not captured in metadata, while also exposing potential spurious correlations and editing limitations. This work offers a data-driven pathway to audit and improve clinical imaging AI by uncovering latent relationships, with implications for robust and trustworthy deployment.
Abstract
Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.
