Table of Contents
Fetching ...

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

Robert Cronshaw, Konstantinos Vilouras, Junyu Yan, Yuning Du, Feng Chen, Steven McDonagh, Sotirios A. Tsaftaris

TL;DR

This paper addresses the challenge of evaluating medical text-to-image generation beyond visual fidelity by focusing on clinical semantics. It introduces CSEval, a modular pipeline that converts generated medical images into textual clinical reports, extracts clinical entities, and computes RadGraph-F1 to measure alignment with the original prompts. The framework demonstrates that CSEval reliably detects semantic misalignments that traditional image-focused metrics miss and exhibits closer agreement with expert radiologist judgments. By enabling scalable, clinically meaningful evaluation, CSEval supports safer deployment of medical generative models in clinical workflows.

Abstract

Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a framework that leverages language models to assess clinical semantic alignment between the generated images and their conditioning prompts. Our experiments show that CSEval identifies semantic inconsistencies overlooked by other metrics and correlates with expert judgment. CSEval provides a scalable and clinically meaningful complement to existing evaluation methods, supporting the safe adoption of generative models in healthcare.

CSEval: A Framework for Evaluating Clinical Semantics in Text-to-Image Generation

TL;DR

This paper addresses the challenge of evaluating medical text-to-image generation beyond visual fidelity by focusing on clinical semantics. It introduces CSEval, a modular pipeline that converts generated medical images into textual clinical reports, extracts clinical entities, and computes RadGraph-F1 to measure alignment with the original prompts. The framework demonstrates that CSEval reliably detects semantic misalignments that traditional image-focused metrics miss and exhibits closer agreement with expert radiologist judgments. By enabling scalable, clinically meaningful evaluation, CSEval supports safer deployment of medical generative models in clinical workflows.

Abstract

Text-to-image generation has been increasingly applied in medical domains for various purposes such as data augmentation and education. Evaluating the quality and clinical reliability of these generated images is essential. However, existing methods mainly assess image realism or diversity, while failing to capture whether the generated images reflect the intended clinical semantics, such as anatomical location and pathology. In this study, we propose the Clinical Semantics Evaluator (CSEval), a framework that leverages language models to assess clinical semantic alignment between the generated images and their conditioning prompts. Our experiments show that CSEval identifies semantic inconsistencies overlooked by other metrics and correlates with expert judgment. CSEval provides a scalable and clinically meaningful complement to existing evaluation methods, supporting the safe adoption of generative models in healthcare.
Paper Structure (17 sections, 1 equation, 2 figures, 2 tables)

This paper contains 17 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our proposed framework for evaluating clinical semantics in text-to-image generation. A user-defined prompt guides the generation of synthetic medical images. Then, a pre-trained report generation model generates the findings of these synthetic images. The selected metric (RadGraph--F1 score), which measures the overlap between ground truth and synthetic entity-relation graphs in text space, quantitatively reflects how accurately the synthetic image adheres to the clinical details described in the original prompt.
  • Figure 2: Synthetic examples with user prompts and RadGraph--F1 scores. A: large left apical pneumothorax (0.00), B: severe cardiomegaly (0.67), C: moderate right pleural effusion (0.29), D: small left lower lobe opacification (0.20).