Table of Contents
Fetching ...

CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Bernhard Kainz, Bjoern Menze

TL;DR

The paper addresses the challenge of evaluating long-context radiology report generation, where traditional NLG metrics fail to capture clinical correctness and fixed LLM evaluators lack generalizability. It introduces the CRG Score, a distribution-aware clinical accuracy metric that concentrates on clinically relevant abnormalities, ignores irrelevant true negatives, balances penalties by label distribution, and supports both binary and structured labels; it is model-agnostic and pairs with any LLM for feature extraction. CRG is computed via the normalized formula $CRG = S_{max} / (2 S_{max} - s)$ with $s = TP * w_{TP} - FN * w_{FN} - FP * w_{FP}$ and weights $w_{TP} = w_{FN} = (T - A)/(2A)$, $w_{FP} = 1$, where $S_{max} = A w_{TP}$. The authors demonstrate CRG on the CT-RATE validation set across multiple CT report generation models, illustrating its clinical alignment and its potential as a reinforcement learning reward to balance fluency and accuracy. Overall, CRG offers a principled, distribution-aware approach to evaluating radiology reports that remains robust under class imbalance and extends to richer label structures in future work.

Abstract

Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.

CRG Score: A Distribution-Aware Clinical Metric for Radiology Report Generation

TL;DR

The paper addresses the challenge of evaluating long-context radiology report generation, where traditional NLG metrics fail to capture clinical correctness and fixed LLM evaluators lack generalizability. It introduces the CRG Score, a distribution-aware clinical accuracy metric that concentrates on clinically relevant abnormalities, ignores irrelevant true negatives, balances penalties by label distribution, and supports both binary and structured labels; it is model-agnostic and pairs with any LLM for feature extraction. CRG is computed via the normalized formula with and weights , , where . The authors demonstrate CRG on the CT-RATE validation set across multiple CT report generation models, illustrating its clinical alignment and its potential as a reinforcement learning reward to balance fluency and accuracy. Overall, CRG offers a principled, distribution-aware approach to evaluating radiology reports that remains robust under class imbalance and extends to richer label structures in future work.

Abstract

Evaluating long-context radiology report generation is challenging. NLG metrics fail to capture clinical correctness, while LLM-based metrics often lack generalizability. Clinical accuracy metrics are more relevant but are sensitive to class imbalance, frequently favoring trivial predictions. We propose the CRG Score, a distribution-aware and adaptable metric that evaluates only clinically relevant abnormalities explicitly described in reference reports. CRG supports both binary and structured labels (e.g., type, location) and can be paired with any LLM for feature extraction. By balancing penalties based on label distribution, it enables fairer, more robust evaluation and serves as a clinically aligned reward function.

Paper Structure

This paper contains 5 sections, 4 equations, 2 tables.