Table of Contents
Fetching ...

RadEdit: stress-testing biomedical vision models via diffusion image editing

Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Daniel C. Castro, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay, Maximilian Ilse

TL;DR

RadEdit addresses the problem of biased and small biomedical imaging datasets by enabling stress-testing of vision models against realistic dataset shifts. It introduces a diffusion-based editing approach that uses keep and edit masks to constrain changes and maintain anatomical consistency, reducing artefacts common in prior methods. The approach is validated across acquisition, manifestation, and population shifts, demonstrating its ability to reveal robustness gaps in classification and segmentation models and to quantify biases without additional data collection. Through the BioViL-T editing score and synthetic test sets, RadEdit provides a practical tool for pre-deployment evaluation, potentially reducing diagnostic errors and patient harm while complementing explainable AI techniques.

Abstract

Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method RadEdit that uses multiple masks, if present, to constrain changes and ensure consistency in the edited images. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI.

RadEdit: stress-testing biomedical vision models via diffusion image editing

TL;DR

RadEdit addresses the problem of biased and small biomedical imaging datasets by enabling stress-testing of vision models against realistic dataset shifts. It introduces a diffusion-based editing approach that uses keep and edit masks to constrain changes and maintain anatomical consistency, reducing artefacts common in prior methods. The approach is validated across acquisition, manifestation, and population shifts, demonstrating its ability to reveal robustness gaps in classification and segmentation models and to quantify biases without additional data collection. Through the BioViL-T editing score and synthetic test sets, RadEdit provides a practical tool for pre-deployment evaluation, potentially reducing diagnostic errors and patient harm while complementing explainable AI techniques.

Abstract

Biomedical imaging datasets are often small and biased, meaning that real-world performance of predictive models can be substantially lower than expected from internal testing. This work proposes using generative image editing to simulate dataset shifts and diagnose failure modes of biomedical vision models; this can be used in advance of deployment to assess readiness, potentially reducing cost and patient harm. Existing editing methods can produce undesirable changes, with spurious correlations learned due to the co-occurrence of disease and treatment interventions, limiting practical applicability. To address this, we train a text-to-image diffusion model on multiple chest X-ray datasets and introduce a new editing method RadEdit that uses multiple masks, if present, to constrain changes and ensure consistency in the edited images. We consider three types of dataset shifts: acquisition shift, manifestation shift, and population shift, and demonstrate that our approach can diagnose failures and quantify model robustness without additional data collection, complementing more qualitative tools for explainable AI.
Paper Structure (53 sections, 3 equations, 19 figures, 3 tables, 3 algorithms)

This paper contains 53 sections, 3 equations, 19 figures, 3 tables, 3 algorithms.

Figures (19)

  • Figure 1: Stress-testing models by simulating dataset shifts via image editing. Top: editing out COVID-19 features results in false positives since the classifier relies on acquisition differences, e.g., radiographic markers (white arrow). Middle: editing out a pneumothorax (PTX) results in false positives since the classifier instead detects chest drains. Bottom: editing abnormalities into lungs causes a lung segmentation model to mislabel (blue: ground-truth segmentation; red: model prediction).
  • Figure 1: Quantifying robustness of COVID-19 detectors to acquisition shift. We train a weak predictor on the 'Biased' dataset---a combination of BIMCV+ vaya_bimcv_2020 and MIMIC-CXR johnson_mimic-cxr_2019; and a strong predictor on an unbiased dataset---a combination of BIMCV+ and BIMCV-; the 'Synthetic' test set consists of 2774 COVID-19-negative images with the same spurious features as the BIMCV+ datasets, e.g. laterality markers. We report mean accuracy and standard deviation across 5 runs.
  • Figure 2: Removing COVID-19 features with LANCEfoot:lance (b) also changes the laterality markers and reduces contrast. In contrast, RadEdit (c; ours) preserves anatomical structures and laterality markers, and retains the original contrast.
  • Figure 2: Quantifying robustness of pneumothorax detectors to manifestation shift. The weak predictor is trained on the biased CANDID-PTX feng_curation_2021 dataset to classify pneumothorax; the strong predictor is trained on SIIM-ACR siim-acr-pneumothorax-segmentation to classify and segment the pneumothorax. Real 'Biased' test data comes from CANDID-PTX which exhibits strong confounding between the pneumothorax and chest tubes; 'Synthetic' test data is 629 solely edited images containing chest drains but no pneumothorax. We report mean accuracy and standard deviation across 5 runs.
  • Figure 3: Removing pneumothorax (red) with LANCEfoot:lance (b) also removes the spuriously correlated chest drain (blue) and reduces contrast. In contrast, RadEdit (c; ours) preserves the chest drain and better preserves anatomical structures.
  • ...and 14 more figures