An Interpretable Local Editing Model for Counterfactual Medical Image Generation

Hyungi Min; Taeseung You; Hangyeul Lee; Yeongjae Cho; Sungzoon Cho

An Interpretable Local Editing Model for Counterfactual Medical Image Generation

Hyungi Min, Taeseung You, Hangyeul Lee, Yeongjae Cho, Sungzoon Cho

TL;DR

InstructX2X is presented, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing and MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs.

Abstract

Counterfactual medical image generation have emerged as a critical tool for enhancing AI-driven systems in medical domain by answering "what-if" questions. However, existing approaches face two fundamental limitations: First, they fail to prevent unintended modifications, resulting collateral changes in demographic attributes when only disease features should be affected. Second, they lack interpretability in their editing process, which significantly limits their utility in real-world medical applications. To address these limitations, we present InstructX2X, a novel interpretable local editing model for counterfactual medical image generation featuring Region-Specific Editing. This approach restricts modifications to specific regions, effectively preventing unintended changes while simultaneously providing a Guidance Map that offers inherently interpretable visual explanations of the editing process. Additionally, we introduce MIMIC-EDIT-INSTRUCTION, a dataset for counterfactual medical image generation derived from expert-verified medical VQA pairs. Through extensive experiments, InstructX2X achieve state-of-the-art performance across all major evaluation metrics. Our model successfully generates high-quality counterfactual chest X-ray images along with interpretable explanations.

An Interpretable Local Editing Model for Counterfactual Medical Image Generation

TL;DR

Abstract

Paper Structure (20 sections, 4 equations, 4 figures, 2 tables)

This paper contains 20 sections, 4 equations, 4 figures, 2 tables.

Introduction
Method
MIMIC-EDIT-INSTRUCTION
Dataset preparation
Instruction Generation
Dataset Statistics
Region-Specific Editing
Experiment
Implementation details
Baselines
Evaluation Metrics
CMIG Score
KL Divergence
Fréchet Inception Distance (FID)
Results
...and 5 more sections

Figures (4)

Figure 1: Comparison of counterfactual medical image generation results between existing methods and our proposed approach. When adding edema features to an input chest X-ray image, existing methods (a–d) demonstrate unintended modifications (red arrows), causing significant variations in age and race (note the demographic predictions below each image). In contrast, InstructX2X preserves the demographic attributes while achieving precise editing and provides a visual explanation via guidance map (red overlay).
Figure 2: Overview of the InstructX2X training framework. The pipeline is adapted from the InstructPix2Pix architecture Instructpix2pix, modified to accept longitudinal chest X-ray pairs. The top panel illustrates the dataset construction process converting descriptions pairs into the MIMIC-EDIT-INSTRUCTION data. The bottom panel shows the training pipeline where the model learns to transform $I_{past}$ to $I_{cur}$ using the constructed instructions.
Figure 3: Overview of our proposed Region-Specific Editing (RSE) mechanism. We apply RSE on top of the inference pipeline of a pre-trained InstructPix2Pix backbone Instructpix2pix. The relevance map computation [1] is adapted from prior work watchyourstep, while the anatomical pseudo masks and their integration with the relevance map to form the guidance map are proposed in this thesis. By restricting edits to instruction-relevant regions, RSE prevents unintended modifications and provides inherent interpretability via the visual guidance map.
Figure 4: Demonstration of fine-grained controllable editing by InstructX2X. The model generates counterfactual images with varying severities (Small vs. Moderate) and precise anatomical locations (Right, Left, Bilateral) for pleural effusion. The guidance maps (red heatmaps) visualize the model's focus, confirming that edits are strictly confined to the instruction-specified regions and intensities.

An Interpretable Local Editing Model for Counterfactual Medical Image Generation

TL;DR

Abstract

An Interpretable Local Editing Model for Counterfactual Medical Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)