Table of Contents
Fetching ...

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Hyuk Gi Hong, Jung-Oh Lee, Hangyul Yoon, Eun Woo Doe, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, Edward Choi

TL;DR

This work tackles the need for temporally coherent, fine-grained evaluation of radiology reports by introducing Lunguage, a benchmark with 1,473 single chest X-ray reports and 80 longitudinal reports annotated at the entity–relation level. It proposes a two-stage, schema-guided LLM framework to transform free text into structured representations and a novel LUNGUAGESCORE metric that jointly measures semantic, temporal, and structural fidelity across patient timelines, using formulations such as $MatchScore(f^{pred},f^{gold})=Semantic\cdot(Temporal\;if\;T>1)\cdot Structural$ with semantic embeddings from clinical BERT models. Empirical results show high agreement with human annotations (entity–relation F1 ≈ 0.94, full triplets ≈ 0.86) and demonstrate LunguageScore’s strong correlation with radiologist judgments on ReXVal, along with its ability to detect longitudinal coherence weaknesses in generation models. The framework enables clinically meaningful, timeline-aware evaluation and highlights the potential for integrating structured radiology outputs with broader EHR data to improve longitudinal diagnostic reasoning.

Abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

TL;DR

This work tackles the need for temporally coherent, fine-grained evaluation of radiology reports by introducing Lunguage, a benchmark with 1,473 single chest X-ray reports and 80 longitudinal reports annotated at the entity–relation level. It proposes a two-stage, schema-guided LLM framework to transform free text into structured representations and a novel LUNGUAGESCORE metric that jointly measures semantic, temporal, and structural fidelity across patient timelines, using formulations such as with semantic embeddings from clinical BERT models. Empirical results show high agreement with human annotations (entity–relation F1 ≈ 0.94, full triplets ≈ 0.86) and demonstrate LunguageScore’s strong correlation with radiologist judgments on ReXVal, along with its ability to detect longitudinal coherence weaknesses in generation models. The framework enables clinically meaningful, timeline-aware evaluation and highlights the potential for integrating structured radiology outputs with broader EHR data to improve longitudinal diagnostic reasoning.

Abstract

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

Paper Structure

This paper contains 58 sections, 12 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Schema for Single and Sequential Report Structuring. The figure shows two reports from the same patient at day 10 and day 90. For the single report schema (within each report), gray solid lines connect entities to attributes, while pink and blue solid lines represent inter-entity reasoning relations (Associate, Evidence). For the sequential schema (across reports), black solid lines denote entities in the same EntityGroup (same clinical finding over time) and TemporalGroup (same diagnostic episodes), while black dashed lines show entities in the same EntityGroup but different TemporalGroups (different diagnostic episodes).
  • Figure A.1: Distribution of the number of imaging studies per patient in Lunguage. Skyblue bars indicate the number of patients for each trajectory length (i.e., number of chest X-ray studies), reflecting the single-report annotation coverage. Salmon bars represent the subset of patients whose reports are also annotated at the longitudinal level. Values above the bars show the number of patients per group (n =), and for salmon bars, the number of patients with sequential annotations. The legend summarizes the total number of patients and reports included at each annotation level.
  • Figure A.2: Annotation interface used during gold dataset construction. Annotators reviewed GPT-4-generated triplets per report section and refined the entity–relation structure to ensure schema correctness and contextual validity.
  • Figure B.1: Overview of our end-to-end pipeline. We begin with gold-standard structured reports (Lunguage) created by radiologists. Candidate free-text reports are generated by a report model and structured via our two-stage framework: (1) schema-aligned extraction (Framework (Single)), and (2) longitudinal grouping and normalization (Framework (Sequential)). Candidate and gold outputs are aligned by entity and temporal groups, and evaluated using LUNGUAGESCORE across semantic, temporal, and structural dimensions. Std. timepoint denotes the acquisition date of each chest X-ray study.
  • Figure B.2: Prompt template used for single-report structuring of chest X-ray findings. The model receives section-wise input sentences along with vocabulary-based candidate spans and is instructed to extract relations and attributes.
  • ...and 6 more figures