MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

Minghao Liu; Zhitao He; Zhiyuan Fan; Qingyun Wang; Yi R. Fung

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

Minghao Liu, Zhitao He, Zhiyuan Fan, Qingyun Wang, Yi R. Fung

TL;DR

MedEBench introduces the first benchmark dedicated to text-guided medical image editing, addressing the lack of clinically meaningful evaluation metrics. It constructs a dataset of 1,182 real before/after medical image pairs across 13 anatomical regions and 70 editing tasks, with prompts, ROI masks, and structured change descriptions. The benchmark couples CP, EA, and VQ assessments—via masked SSIM and GPT-4o-based judgments—with attention-grounding diagnostics to identify mislocalization and grounding failures. Across seven baselines, Gemini 2 Flash achieves the best overall performance, while fine-tuning shows clear benefits in data-scarce medical domains, and in-context prompting exhibits limited generalization for pixel-level edits. The results underscore the need for anatomy-aware architectures and medically supervised training to improve reliability and safety in clinical image editing.

Abstract

Text-guided image editing has seen significant progress in natural image domains, but its application in medical imaging remains limited and lacks standardized evaluation frameworks. Such editing could revolutionize clinical practices by enabling personalized surgical planning, enhancing medical education, and improving patient communication. To bridge this gap, we introduce MedEBench1, a robust benchmark designed to diagnose reliability in text-guided medical image editing. MedEBench consists of 1,182 clinically curated image-prompt pairs covering 70 distinct editing tasks and 13 anatomical regions. It contributes in three key areas: (1) a clinically grounded evaluation framework that measures Editing Accuracy, Context Preservation, and Visual Quality, complemented by detailed descriptions of intended edits and corresponding Region-of-Interest (ROI) masks; (2) a comprehensive comparison of seven state-of-theart models, revealing consistent patterns of failure; and (3) a diagnostic error analysis technique that leverages attention alignment, using Intersection-over-Union (IoU) between model attention maps and ROI masks to identify mislocalization issues, where models erroneously focus on incorrect anatomical regions. MedEBench sets the stage for developing more reliable and clinically effective text-guided medical image editing tools.

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

TL;DR

Abstract

MedEBench: Diagnosing Reliability in Text-Guided Medical Image Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)