Table of Contents
Fetching ...

MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks

Elad Hirsch, Gefen Dawidowicz, Ayellet Tal

TL;DR

MedRAT tackles unpaired medical report generation by training a report auto-encoder on text while learning a shared multimodal space for images and reports via two auxiliary tasks. A memory-augmented shared encoder–decoder with global and local representations aligns modalities without paired data, enabling image-to-report generation at inference. The two auxiliary tasks—multi-label contrastive learning and multi-label classification—drive cross-modal alignment and robust representations, improving both language metrics and clinical efficacy. Results on CheXpert/MIMIC-CXR and IU X-ray show state-of-the-art performance among unpaired methods, narrowing the gap to fully paired approaches while preserving privacy.

Abstract

Medical report generation from X-ray images is a challenging task, particularly in an unpaired setting where paired image-report data is unavailable for training. To address this challenge, we propose a novel model that leverages the available information in two distinct datasets, one comprising reports and the other consisting of images. The core idea of our model revolves around the notion that combining auto-encoding report generation with multi-modal (report-image) alignment can offer a solution. However, the challenge persists regarding how to achieve this alignment when pair correspondence is absent. Our proposed solution involves the use of auxiliary tasks, particularly contrastive learning and classification, to position related images and reports in close proximity to each other. This approach differs from previous methods that rely on pre-processing steps, such as using external information stored in a knowledge graph. Our model, named MedRAT, surpasses previous state-of-the-art methods, demonstrating the feasibility of generating comprehensive medical reports without the need for paired data or external tools.

MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks

TL;DR

MedRAT tackles unpaired medical report generation by training a report auto-encoder on text while learning a shared multimodal space for images and reports via two auxiliary tasks. A memory-augmented shared encoder–decoder with global and local representations aligns modalities without paired data, enabling image-to-report generation at inference. The two auxiliary tasks—multi-label contrastive learning and multi-label classification—drive cross-modal alignment and robust representations, improving both language metrics and clinical efficacy. Results on CheXpert/MIMIC-CXR and IU X-ray show state-of-the-art performance among unpaired methods, narrowing the gap to fully paired approaches while preserving privacy.

Abstract

Medical report generation from X-ray images is a challenging task, particularly in an unpaired setting where paired image-report data is unavailable for training. To address this challenge, we propose a novel model that leverages the available information in two distinct datasets, one comprising reports and the other consisting of images. The core idea of our model revolves around the notion that combining auto-encoding report generation with multi-modal (report-image) alignment can offer a solution. However, the challenge persists regarding how to achieve this alignment when pair correspondence is absent. Our proposed solution involves the use of auxiliary tasks, particularly contrastive learning and classification, to position related images and reports in close proximity to each other. This approach differs from previous methods that rely on pre-processing steps, such as using external information stored in a knowledge graph. Our model, named MedRAT, surpasses previous state-of-the-art methods, demonstrating the feasibility of generating comprehensive medical reports without the need for paired data or external tools.
Paper Structure (10 sections, 5 equations, 4 figures, 5 tables)

This paper contains 10 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Unpaired medical report generation. (a) We propose a model that addresses the challenge of unpaired images and reports by learning to generate reports from reports, and embedding related images and reports close together in the embedding space. Notably, our model achieves this without access to image-report pairs during training. (b) By learning both tasks simultaneously, our model is able to generate detailed reports for X-ray images during inference.
  • Figure 2: Method. (a) During the feature extraction stage, reports and images pass through separate streams. Report words and image patches are encoded and combined with memory vectors that have been queried from the learned shared memory. (b) The textual and visual embeddings are separately fed into a shared encoder, producing local representations in a shared space. These representations are then aggregated into a global representation using self-attention (SA). (c) Multi-modal alignment is performed through auxiliary tasks---classification and contrastive learning---which pull closer relevant global representations or push them apart. (d) Simultaneously, local report representations are augmented during training, and the text decoder receives input from both the global and local representations, to produce the final report. During training, this is done with the report representations, while at inference, it uses the image representations. In this figure, the solid green & orange lines indicate the training phase, while the dashed lines represent the inference phase.
  • Figure 3: Qualitative evaluation. Our model-generated report (c) contains similar information to the ground-truth report (b). It describes the location of the endotracheal tube tip above the carina, the presence of edema , low lung volumes , and irregularities in the pulmonary vascular , while ruling out pneumothorax and suggesting only a possibility for small pleural effusion. The report contains much more information than just the presence of edema (the information used for training) and uses similar phrases as the ground-truth report to describe the findings.
  • Figure 4: Attention visualization. (a) presents the input and (b-d) show examples of attention maps generated by our model, where bright values represent high attention. These maps demonstrate where the model is focusing when predicting specific words. For example, when predicting "heart" (b), the model's attention is on the central area around the heart; for "tip" (c), it concentrates around the trachea above the carina; and for "pleural" (d), it focuses on the left pleural cavity area (right side of the image). Notably, this is achieved without training on patch-word alignment.