CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

Robert Cronshaw; Konstantinos Vilouras; Junyu Yan; Yuning Du; Feng Chen; Steven McDonagh; Sotirios A. Tsaftaris

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

Robert Cronshaw, Konstantinos Vilouras, Junyu Yan, Yuning Du, Feng Chen, Steven McDonagh, Sotirios A. Tsaftaris

TL;DR

This paper addresses the challenge of evaluating medical text-to-image generation beyond visual fidelity by focusing on clinical semantics. It introduces CSEval, a modular pipeline that converts generated medical images into textual clinical reports, extracts clinical entities, and computes RadGraph-F1 to measure alignment with the original prompts. The framework demonstrates that CSEval reliably detects semantic misalignments that traditional image-focused metrics miss and exhibits closer agreement with expert radiologist judgments. By enabling scalable, clinically meaningful evaluation, CSEval supports safer deployment of medical generative models in clinical workflows.

Abstract

Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a framework that leverages language models to assess clinical semantic alignment between the generated images and their conditioning prompts. Our experiments show that CSEval identifies semantic inconsistencies overlooked by other metrics and correlates with expert judgment. CSEval provides a scalable and clinically meaningful complement to existing evaluation methods, supporting the safe adoption of generative models in healthcare.

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 2 figures, 2 tables)

This paper contains 17 sections, 1 equation, 2 figures, 2 tables.

Introduction
Related Work
Text-to-Image Generation in Medical Imaging
Evaluation Metrics for Synthetic Images
Clinical Evaluation Metrics for Automated Report Generation
Methodology
Main Idea
Design of CSEval
Prompt Template for Image Generation
Report Generation Module
Entity Recognition Module
Experiments
Implementation Details
Results
Conclusion
...and 2 more sections

Figures (2)

Figure 1: Our proposed framework for evaluating clinical semantics in text-to-image generation. A user-defined prompt guides the generation of synthetic medical images. Then, a pre-trained report generation model generates the findings of these synthetic images. The selected metric (RadGraph--F1 score), which measures the overlap between ground truth and synthetic entity-relation graphs in text space, quantitatively reflects how accurately the synthetic image adheres to the clinical details described in the original prompt.
Figure 2: Synthetic examples with user prompts and RadGraph--F1 scores. A: large left apical pneumothorax (0.00), B: severe cardiomegaly (0.67), C: moderate right pleural effusion (0.29), D: small left lower lobe opacification (0.20).

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

TL;DR

Abstract

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)