Table of Contents
Fetching ...

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

Minghao Liu, Zhitao He, Zhiyuan Fan, Qingyun Wang, Yi R. Fung

TL;DR

MedEBench introduces the first benchmark dedicated to text-guided medical image editing, addressing the lack of clinically meaningful evaluation metrics. It constructs a dataset of 1,182 real before/after medical image pairs across 13 anatomical regions and 70 editing tasks, with prompts, ROI masks, and structured change descriptions. The benchmark couples CP, EA, and VQ assessments—via masked SSIM and GPT-4o-based judgments—with attention-grounding diagnostics to identify mislocalization and grounding failures. Across seven baselines, Gemini 2 Flash achieves the best overall performance, while fine-tuning shows clear benefits in data-scarce medical domains, and in-context prompting exhibits limited generalization for pixel-level edits. The results underscore the need for anatomy-aware architectures and medically supervised training to improve reliability and safety in clinical image editing.

Abstract

Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive comparison of seven state-of-theart models, revealing consistent patterns of failure; and (3) a diagnostic error analysis technique that leverages attention alignment, using Intersection-over-Union (IoU) between model attention maps and ROI masks to identify mislocalization issues, where models erroneously focus on incorrect anatomical regions. MedEBench sets the stage for developing more reliable and clinically effective text-guided medical image editing tools.

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

TL;DR

MedEBench introduces the first benchmark dedicated to text-guided medical image editing, addressing the lack of clinically meaningful evaluation metrics. It constructs a dataset of 1,182 real before/after medical image pairs across 13 anatomical regions and 70 editing tasks, with prompts, ROI masks, and structured change descriptions. The benchmark couples CP, EA, and VQ assessments—via masked SSIM and GPT-4o-based judgments—with attention-grounding diagnostics to identify mislocalization and grounding failures. Across seven baselines, Gemini 2 Flash achieves the best overall performance, while fine-tuning shows clear benefits in data-scarce medical domains, and in-context prompting exhibits limited generalization for pixel-level edits. The results underscore the need for anatomy-aware architectures and medically supervised training to improve reliability and safety in clinical image editing.

Abstract

Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive comparison of seven state-of-theart models, revealing consistent patterns of failure; and (3) a diagnostic error analysis technique that leverages attention alignment, using Intersection-over-Union (IoU) between model attention maps and ROI masks to identify mislocalization issues, where models erroneously focus on incorrect anatomical regions. MedEBench sets the stage for developing more reliable and clinically effective text-guided medical image editing tools.

Paper Structure

This paper contains 40 sections, 2 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: State-of-the-Art model performs well on common images (e.g., “add a missing key”) but surprisingly struggles with medical images (e.g., “add a missing tooth”).
  • Figure 2: Overview of MedEBench, a text-guided benchmark for medical image editing. (A) Data preparation includes collecting image triplets, generating ROI masks, and describing intended changes. (B) Models generate edited images from prompts and previous images. (C) SSIM structural similarity index measures contextual preservation; editing accuracy and visual quality are assessed by GPT-4o, guided by the change description.
  • Figure 3: Organ distribution
  • Figure 4: Properties of the MedEBench dataset: CLIP score between the instruction and the preceding image, and the ROI mask ratio relative to the full image.
  • Figure 5: Per-organ performance comparison of seven image editing models across three metrics.
  • ...and 8 more figures