MAIRA-2: Grounded Radiology Report Generation

Shruthi Bannur; Kenza Bouzid; Daniel C. Castro; Anton Schwaighofer; Anja Thieme; Sam Bond-Taylor; Maximilian Ilse; Fernando Pérez-García; Valentina Salvatelli; Harshita Sharma; Felix Meissen; Mercy Ranjit; Shaury Srivastav; Julia Gong; Noel C. F. Codella; Fabian Falck; Ozan Oktay; Matthew P. Lungren; Maria Teodora Wetscherek; Javier Alvarez-Valle; Stephanie L. Hyland

MAIRA-2: Grounded Radiology Report Generation

Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, Stephanie L. Hyland

TL;DR

MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding, achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.

Abstract

Radiology reporting is a complex task requiring detailed medical image understanding and precise language generation, for which generative multimodal models offer a promising solution. However, to impact clinical practice, models must achieve a high level of both verifiable performance and utility. We augment the utility of automated report generation by incorporating localisation of individual findings on the image - a task we call grounded report generation - and enhance performance by incorporating realistic reporting context as inputs. We design a novel evaluation framework (RadFact) leveraging the logical inference capabilities of large language models (LLMs) to quantify report correctness and completeness at the level of individual sentences, while supporting the new task of grounded reporting. We develop MAIRA-2, a large radiology-specific multimodal model designed to generate chest X-ray reports with and without grounding. MAIRA-2 achieves state of the art on existing report generation benchmarks and establishes the novel task of grounded report generation.

MAIRA-2: Grounded Radiology Report Generation

TL;DR

Abstract

Paper Structure (47 sections, 8 figures, 18 tables)

This paper contains 47 sections, 8 figures, 18 tables.

Extended background and related work
Why is grounded reporting a useful task?
Why do we expect additional inputs to help?
Indication section:
Prior studies:
Lateral view:
Comparison section:
Technique section:
Extended methods
Datasets used to train and evaluate MAIRA-2
MIMIC-CXR johnson2019mimic-cxr-dataset
PadChest bustos2020padchest
USMix
IU-Xray demner2016preparing
Additional MAIRA-2 model and training details
...and 32 more sections

Figures (8)

Figure 1: Grounded report generation with MAIRA-2. (Panel A) An illustrative example of the grounded reporting task. A grounded report is a list of sentences potentially linked to spatial annotations (bounding boxes, in this work). Normal anatomy or non-findings, as well as non-localisable observations, do not require spatial annotations. To generate a grounded report, the model can be presented with all or some of the following: the current study's frontal and lateral X-ray images; indication, technique, and comparison; prior study's frontal image and report; along with a task-specific instruction. The Indication provides clinical context on the patient and influences interpretation and reporting. The Technique describes acquired views and sometimes patient positioning (e.g. supine, lateral), while Comparison indicates whether the radiologist consulted prior studies. This example does not have a prior study so the model receives no prior frontal image or prior report. (Panel B) The MAIRA-2 model ingests interleaved text and images, using a frozen vision encoder (Rad-DINO-MAIRA-2) and training an adapter and an autoregressive language model. Each 518×518 image is processed into patches of size 14×14 and encoded by Rad-DINO-MAIRA-2 into a sequence of 1369 visual tokens. We do not use the $\langle$CLS$\rangle$token. (Panel C) We equip the language model with coordinate tokens enabling it to describe locations on a grid over the image. Bounding boxes are represented using the top-left and bottom-right coordinates of the box. Each grounded finding is then a single sentence followed by one or more boxes, as illustrated. A non-grounded finding is simply described by a single sentence.
Figure 2: Illustration of RadFact. The proposed suite of RadFact metrics enables evaluating both text reports and grounding annotations. It is based on logical inference, using an LLM with task-specific prompting to classify hypotheses as entailed or not, given premises. The generated report is evaluated against a ground-truth report to compute precision metrics (top left), and conversely for recall metrics (top right). Detailed panel (bottom) shows a single direction of evaluation, taking the model generations as logical hypotheses and the original report as premises. Here, logical precision measures the fraction of generated sentences that are entailed by sentences from the original report. Grounding precision is the fraction of logically entailed, grounded sentences whose spatial annotations are also entailed. Spatial precision is the fraction of all grounded sentences whose spatial annotations are also entailed, hence it is upper-bounded by grounding precision. Here, spatial annotations of a sentence are one or more boxes (see sentence B). Spatial entailment requires that at least 50% of the pixels associated with the sentence fall into the union of matched evidence boxes. In the above, sentence B's evidence comes from premises 4 and 5, hence its boxes are compared with the boxes from 4 and 5.
Figure 3: MAIRA-2 can generate grounded reports, and establishes new state-of-the-art in non-grounded report generation. (Panel A) Performance on the grounded reporting task on GR-Bench (USMix) and PadChest-GR. MAIRA-2 achieves RadFact logical precision above 50% with high grounding precision (68.8%, 80.2% respectively) and moderate spatial precision (33.5%, 37.1%). (Panel B) On MIMIC-CXR we compare to the closest prior state of the art, restricted to models evaluated for Findings generation, namely Med-PaLM M tu2024medpalmm (with a different test set, counting the laterals as individual samples), LLaVA-Rad chaves2024towards, MedVersa zhou2024generalist, and MAIRA-1 hyland2024maira1. Since many of these models are not publicly available, we present their evaluation results as originally reported, for available metrics. For MAIRA-1, we obtained the model generations on the MIMIC-CXR test set in order to run RadFact. There is no prior work evaluating on PadChest, hence we report MAIRA-2 performance to establish a benchmark. IU-Xray is used as a fully held-out evaluation dataset. High RadFact logical precision and recall on IU-Xray demonstrate that MAIRA-2 generalises well to an unseen dataset. We report median and 95% confidence intervals based on 500 bootstrap samples. '$\downarrow$' indicates that lower is better. CheXpert F$_1$ metrics are computed based on CheXbert labeller outputs. RadFact uses RadFact-Llama3.
Figure 4: In-depth qualitative review on the performance of MAIRA-2 on twenty randomly-selected examples from GR-Bench. A thoracic radiologist was asked to assess every generated sentence and accept as-is, edit, delete, or add additional sentences. (Panel A) Of the 135 generated sentences, the majority (90%, n=123) did not require any edits, amounting to six (30%) fully-correct generated reports. Few edits related to clinically significant findings, with the majority of studies (90%, n=18) having errors of no or minor clinical implications. (Panel B) Of the 25 errors (edits to sentences or additions), the majority (60%, n=15) were omissions where MAIRA-2 failed to generate a finding. (Panel C) Most errors were deemed to have minor or no clinical implications (92%, n=23). The full set of errors with explanation are provided in \ref{['tab:corrections_add', 'tab:corrections_del', 'tab:corrections_edits']}.
Figure 5: Impact of dropping the model inputs during both training and inference ('Train:') and during inference only ('Infer:') on MIMIC Findings generation. (Panel A) Dropping the prior study and comparison for the 88.6% test subset that have a Prior (n=2181). %Comparison mentions is estimated using Llama3-70B. The dashed line indicates the frequency of comparison mentions (91.84%) in the ground-truth reports in the same data subset, for reference. (Panel B) Impact of dropping the lateral view and the technique section for the 30.6% test subset that have a Lateral view (n=1,116). The dashed line indicates the frequency of lateral mentions (35.57%) in the ground-truth reports in the same data subset, for reference. We report median and 95% confidence intervals based on 500 bootstrap samples. '$\downarrow$' indicates that lower is better. Tabular representations of these results are available in \ref{['tab:fg_prior_and_comparison_ablation', 'tab:fg_lateral_and_technique_ablation']}, respectively. Note that for these ablations, we used a slightly earlier variant of MAIRA-2 trained without PadChest-GR.
...and 3 more figures

MAIRA-2: Grounded Radiology Report Generation

TL;DR

Abstract

MAIRA-2: Grounded Radiology Report Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)