CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation
Robert Cronshaw, Konstantinos Vilouras, Junyu Yan, Yuning Du, Feng Chen, Steven McDonagh, Sotirios A. Tsaftaris
TL;DR
This paper addresses the challenge of evaluating medical text-to-image generation beyond visual fidelity by focusing on clinical semantics. It introduces CSEval, a modular pipeline that converts generated medical images into textual clinical reports, extracts clinical entities, and computes RadGraph-F1 to measure alignment with the original prompts. The framework demonstrates that CSEval reliably detects semantic misalignments that traditional image-focused metrics miss and exhibits closer agreement with expert radiologist judgments. By enabling scalable, clinically meaningful evaluation, CSEval supports safer deployment of medical generative models in clinical workflows.
Abstract
Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a framework that leverages language models to assess clinical semantic alignment between the generated images and their conditioning prompts. Our experiments show that CSEval identifies semantic inconsistencies overlooked by other metrics and correlates with expert judgment. CSEval provides a scalable and clinically meaningful complement to existing evaluation methods, supporting the safe adoption of generative models in healthcare.
