CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon; Thibault Heintz; Siavash Raissi; Mahmoud Alabbad; Mona Alhammad; Hassan AlOmaish; Sung Eun Kim; Oishi Banerjee; Pranav Rajpurkar

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad, Hassan AlOmaish, Sung Eun Kim, Oishi Banerjee, Pranav Rajpurkar

TL;DR

The metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.

Abstract

We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 4 figures, 1 table)

This paper contains 12 sections, 3 equations, 4 figures, 1 table.

Introduction
Related Work
CRIMSON: Clinically-Grounded Report Evaluation
Finding Extraction and Clinical Significance Assignment
Error Taxonomy and Classification
Severity-Aware Scoring
Results
Correlation with Radiologist-Annotated Significant Errors
Radiologist-Guided Clinical Judgment Test
Radiologist Preference Alignment
MedGemma Fine-tuning and Analysis
Discussion

Figures (4)

Figure 1: Representative RadJudge cases illustrating core design principles of CRIMSON. Top: Patient context sensitivity. The clinical impact of an omission (e.g., aortic atherosclerosis) varies by age and indication, and CRIMSON adjusts severity accordingly. Middle: Normal finding handling. CRIMSON does not reward mentioning normal findings, preventing score inflation. Bottom: Clinical significance weighting. Errors are weighted by consequence, prioritizing clinically important findings. In each case, CRIMSON aligns with radiologist expectations, whereas prior metrics fail.
Figure 2: RadJudge results. For each case, metrics are evaluated based on whether their relative ranking of multiple candidate reports agrees with the expected ordering determined with agreement across three attending cardiothoracic radiologists. Each category contains three cases; entries are cases passed (out of 3), with totals out of 30.
Figure 3: Radiologist Preference Alignment (RadPref). Correlation between metric score and radiologist rating differences across 100 pairwise cases. Each point corresponds to a case comparing two candidate reports for the same reference report.
Figure 4: MedGemmaCRIMSON vs GPT-5.2. A) Mean absolute error across false findings, missing findings, and attribute errors per radiologist. B) Severity categorization confusion matrices between three radiologists and CRIMSON, computed only on matched errors (i.e., findings for which both the radiologist and CRIMSON identified an error in the same category). Titles show the percentage of cases for which the radiologist and CRIMSON agree on error category. Color intensity represents the within-row percentage.

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

TL;DR

Abstract

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)