Table of Contents
Fetching ...

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun

Abstract

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

Abstract

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.
Paper Structure (17 sections, 4 figures, 5 tables)

This paper contains 17 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Human--LLM collaborative framework for multi-label finding extraction from chest X-ray reports. (a) Exploratory extraction and clustering of disease-related entities from 3,000 chest X-ray reports to inform the definition of standardized finding labels. (b) Establishment of the diagnostic reference standard through independent binary annotation of reports by six board-certified thoracic radiologists. (c) Illustration of the Human--LLM collaborative workflow using pneumothorax as an example, showing iterative clinician-guided prompt refinement based on error analysis to improve finding-level extraction performance toward higher agreement with the radiologist-derived reference standard.
  • Figure 2: Flowchart of dataset construction and cohort allocation. Reports from MIMIC-CXR were screened and divided into three non-overlapping MIMIC-CXR-EN cohorts: a 3,000-report cohort for taxonomy construction, a 300-report development cohort for prompt optimization and radiologist reference-standard-based evaluation, and an independent 3,000-report test cohort for downstream benchmarking of report generation models. An additional 150-report ChestX-CN validation cohort was used for external validation.
  • Figure 3: Micro-averaged performance of radiology report generation models on the MIMIC-CXR-EN test cohort. Micro-averaged precision, recall and F1 score were calculated across all finding labels, with each label weighted according to its prevalence in the dataset. Model predictions were evaluated using the optimized Qwen3-14B labeler.
  • Figure 4: Macro-averaged performance of radiology report generation models across finding categories. Macro-averaged precision, recall and F1 score were calculated by treating all finding labels equally. This approach emphasizes model performance on low-prevalence and rare findings.