Table of Contents
Fetching ...

ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation

Xiaoman Zhang, Hong-Yu Zhou, Xiaoli Yang, Oishi Banerjee, Julián N. Acosta, Josh Miller, Ouwen Huang, Pranav Rajpurkar

TL;DR

ReXrank enables meaningful comparisons of model performance and offers crucial insights into their robustness across diverse clinical settings, and sets the stage for comprehensive evaluation of automated reporting across the full spectrum of medical imaging.

Abstract

AI-driven models have demonstrated significant potential in automating radiology report generation for chest X-rays. However, there is no standardized benchmark for objectively evaluating their performance. To address this, we present ReXrank, https://rexrank.ai, a public leaderboard and challenge for assessing AI-powered radiology report generation. Our framework incorporates ReXGradient, the largest test dataset consisting of 10,000 studies, and three public datasets (MIMIC-CXR, IU-Xray, CheXpert Plus) for report generation assessment. ReXrank employs 8 evaluation metrics and separately assesses models capable of generating only findings sections and those providing both findings and impressions sections. By providing this standardized evaluation framework, ReXrank enables meaningful comparisons of model performance and offers crucial insights into their robustness across diverse clinical settings. Beyond its current focus on chest X-rays, ReXrank's framework sets the stage for comprehensive evaluation of automated reporting across the full spectrum of medical imaging.

ReXrank: A Public Leaderboard for AI-Powered Radiology Report Generation

TL;DR

ReXrank enables meaningful comparisons of model performance and offers crucial insights into their robustness across diverse clinical settings, and sets the stage for comprehensive evaluation of automated reporting across the full spectrum of medical imaging.

Abstract

AI-driven models have demonstrated significant potential in automating radiology report generation for chest X-rays. However, there is no standardized benchmark for objectively evaluating their performance. To address this, we present ReXrank, https://rexrank.ai, a public leaderboard and challenge for assessing AI-powered radiology report generation. Our framework incorporates ReXGradient, the largest test dataset consisting of 10,000 studies, and three public datasets (MIMIC-CXR, IU-Xray, CheXpert Plus) for report generation assessment. ReXrank employs 8 evaluation metrics and separately assesses models capable of generating only findings sections and those providing both findings and impressions sections. By providing this standardized evaluation framework, ReXrank enables meaningful comparisons of model performance and offers crucial insights into their robustness across diverse clinical settings. Beyond its current focus on chest X-rays, ReXrank's framework sets the stage for comprehensive evaluation of automated reporting across the full spectrum of medical imaging.

Paper Structure

This paper contains 9 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An illustration of ReXrank, a public leaderboard and challenge for AI-powered radiology report generation from chest X-ray images. ReXrank supports model submissions and evaluates them on both public datasets and a large-scale private dataset, providing comprehensive rankings of all submitted models.
  • Figure 2: Comprehensive performance evaluation and ranking of report generation models based on the average 1/RadCliQ-v1 metric of four distinct datasets: ReXGradient, MIMIC-CXR, IU X-ray, and CheXpert Plus. MedVersa (highlighted in purple) demonstrates consistently superior performance, achieving significantly higher scores compared to other models, including GPT4V (highlighted in yellow).
  • Figure 3: Distribution of evaluation metrics across different datasets: ReXGradient, and MIMIC-CXR, IU X-ray, CheXpert Plus. Box plots show the variation in model performance for each metric. For consistency in visualization, we plot the reciprocals (1/x) of FineRadScore and RadCliQ-v1, so higher values indicate better performance across all metrics.