MedRAT: Unpaired Medical Report Generation via Auxiliary Tasks
Elad Hirsch, Gefen Dawidowicz, Ayellet Tal
TL;DR
MedRAT tackles unpaired medical report generation by training a report auto-encoder on text while learning a shared multimodal space for images and reports via two auxiliary tasks. A memory-augmented shared encoder–decoder with global and local representations aligns modalities without paired data, enabling image-to-report generation at inference. The two auxiliary tasks—multi-label contrastive learning and multi-label classification—drive cross-modal alignment and robust representations, improving both language metrics and clinical efficacy. Results on CheXpert/MIMIC-CXR and IU X-ray show state-of-the-art performance among unpaired methods, narrowing the gap to fully paired approaches while preserving privacy.
Abstract
Medical report generation from X-ray images is a challenging task, particularly in an unpaired setting where paired image-report data is unavailable for training. To address this challenge, we propose a novel model that leverages the available information in two distinct datasets, one comprising reports and the other consisting of images. The core idea of our model revolves around the notion that combining auto-encoding report generation with multi-modal (report-image) alignment can offer a solution. However, the challenge persists regarding how to achieve this alignment when pair correspondence is absent. Our proposed solution involves the use of auxiliary tasks, particularly contrastive learning and classification, to position related images and reports in close proximity to each other. This approach differs from previous methods that rely on pre-processing steps, such as using external information stored in a knowledge graph. Our model, named MedRAT, surpasses previous state-of-the-art methods, demonstrating the feasibility of generating comprehensive medical reports without the need for paired data or external tools.
