Table of Contents
Fetching ...

Image-aware Evaluation of Generated Medical Reports

Gefen Dawidowicz, Elad Hirsch, Ayellet Tal

TL;DR

A novel evaluation metric for automatic medical report generation from X-ray images, VLScore, to measure the similarity between radiology reports while considering the corresponding image, to overcome the limitations of existing evaluation methods.

Abstract

The paper proposes a novel evaluation metric for automatic medical report generation from X-ray images, VLScore. It aims to overcome the limitations of existing evaluation methods, which either focus solely on textual similarities, ignoring clinical aspects, or concentrate only on a single clinical aspect, the pathology, neglecting all other factors. The key idea of our metric is to measure the similarity between radiology reports while considering the corresponding image. We demonstrate the benefit of our metric through evaluation on a dataset where radiologists marked errors in pairs of reports, showing notable alignment with radiologists' judgments. In addition, we provide a new dataset for evaluating metrics. This dataset includes well-designed perturbations that distinguish between significant modifications (e.g., removal of a diagnosis) and insignificant ones. It highlights the weaknesses in current evaluation metrics and provides a clear framework for analysis.

Image-aware Evaluation of Generated Medical Reports

TL;DR

A novel evaluation metric for automatic medical report generation from X-ray images, VLScore, to measure the similarity between radiology reports while considering the corresponding image, to overcome the limitations of existing evaluation methods.

Abstract

The paper proposes a novel evaluation metric for automatic medical report generation from X-ray images, VLScore. It aims to overcome the limitations of existing evaluation methods, which either focus solely on textual similarities, ignoring clinical aspects, or concentrate only on a single clinical aspect, the pathology, neglecting all other factors. The key idea of our metric is to measure the similarity between radiology reports while considering the corresponding image. We demonstrate the benefit of our metric through evaluation on a dataset where radiologists marked errors in pairs of reports, showing notable alignment with radiologists' judgments. In addition, we provide a new dataset for evaluating metrics. This dataset includes well-designed perturbations that distinguish between significant modifications (e.g., removal of a diagnosis) and insignificant ones. It highlights the weaknesses in current evaluation metrics and provides a clear framework for analysis.

Paper Structure

This paper contains 10 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Vision-language based evaluation of generated reports. Our proposed evaluation metric for a reference report (GT) and a candidate report takes into account the associated image, thereby overcoming several drawbacks of common metrics that only measure the distance between two texts. Consequently, when comparing two semantically very similar reports that suit an image (a), our metric suggests a high degree of similarity, whereas other metrics are unaware of these similarities and therefore give low scores due to textual differences. In contrast, when comparing two reports that are textually similar but differ by a single word that impacts the location of a finding (b), common metrics still provide a very high score, while our metric penalizes for that error.
  • Figure 2: Scores of equivalent normal reports. When comparing two reports with no findings, the score is expected to be high, as both convey the same clinical findings with differences only in the writing of the report. Compared to BLEU-4 (a) and RadGraph F1 (b), our metric yields high scores (right-side), while the other metrics yield low scores (bottom-side). BERTScore (c) provides mid-range scores instead of high scores. CheXpert (d) provides higher scores than the others, yet lower scores than our metric, although it was expected to yield scores close to $1$ as both reports in each pair contain no findings.