Table of Contents
Fetching ...

Improving Medical Visual Representations via Radiology Report Generation

Keegan Quigley, Miriam Cha, Josh Barua, Geeticka Chauhan, Seth Berkowitz, Steven Horng, Polina Golland

TL;DR

This work introduces RadTex, a radiology-focused encoder-decoder model trained with bidirectional captioning to learn fine-grained visual-text representations. It demonstrates that generative captioning pretraining can match or exceed contrastive MVLP in downstream tasks while enabling rapid radiology report generation and interactive prompting. Through comprehensive ablations, the authors show that longer context, domain-specific vocabulary, and priors removal improve performance and reduce hallucinations, with MS-COCO pretraining enhancing initialization. RadTex also offers data-efficient transfer learning and interpretable outputs, suggesting practical utility in radiology workflows and potential extension to other domains requiring localized visual-semantic understanding.

Abstract

Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks. Contrastive learning approaches have increasingly been adopted for medical vision language pretraining (MVLP), yet recent developments in generative AI offer new modeling alternatives. This paper introduces RadTex, a CNN-encoder transformer-decoder architecture optimized for radiology. We explore bidirectional captioning as an alternative MVLP strategy and demonstrate that RadTex's captioning pretraining is competitive with established contrastive methods, achieving a CheXpert macro-AUC of 89.4%. Additionally, RadTex's lightweight text decoder not only generates clinically relevant radiology reports (macro-F1 score of 0.349), but also provides targeted, interactive responses, highlighting the utility of bidirectional captioning in advancing medical image analysis.

Improving Medical Visual Representations via Radiology Report Generation

TL;DR

This work introduces RadTex, a radiology-focused encoder-decoder model trained with bidirectional captioning to learn fine-grained visual-text representations. It demonstrates that generative captioning pretraining can match or exceed contrastive MVLP in downstream tasks while enabling rapid radiology report generation and interactive prompting. Through comprehensive ablations, the authors show that longer context, domain-specific vocabulary, and priors removal improve performance and reduce hallucinations, with MS-COCO pretraining enhancing initialization. RadTex also offers data-efficient transfer learning and interpretable outputs, suggesting practical utility in radiology workflows and potential extension to other domains requiring localized visual-semantic understanding.

Abstract

Vision-language pretraining has been shown to produce high-quality visual encoders which transfer efficiently to downstream computer vision tasks. Contrastive learning approaches have increasingly been adopted for medical vision language pretraining (MVLP), yet recent developments in generative AI offer new modeling alternatives. This paper introduces RadTex, a CNN-encoder transformer-decoder architecture optimized for radiology. We explore bidirectional captioning as an alternative MVLP strategy and demonstrate that RadTex's captioning pretraining is competitive with established contrastive methods, achieving a CheXpert macro-AUC of 89.4%. Additionally, RadTex's lightweight text decoder not only generates clinically relevant radiology reports (macro-F1 score of 0.349), but also provides targeted, interactive responses, highlighting the utility of bidirectional captioning in advancing medical image analysis.
Paper Structure (32 sections, 1 equation, 4 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 1 equation, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of RadTex architecture and outputs, including pretraining, classification, and report generation in the Prompted setting. We show that bidirectional captioning is an effective method of medical vision-language pretraining (left), exceeding contrastive learning performance on downstream visual tasks while also enhancing interpretability. After generative pretraining, the RadTex visual encoder is frozen and a linear head is trained to classify pathologies (center; \ref{['sec:visual_downstream']}). Furthermore, the entire pretrained model can be used directly for radiology report generation by sampling tokens from the output (right; \ref{['sec:rrg']}).
  • Figure 2: Bar plot showing linear classification results. RadTex is competitive with CheXzero and other methods across multiple downstream classification tasks. RadTex results are for RadTex/C+M pretraining. Each model's visual backbone is frozen and a linear layer is trained in three separate trials. We display mean results over three random trials. See \ref{['app:vision_encoder_results']} for more details, including standard errors.
  • Figure 3: Confusion matrices showing multi-level grading performance for RadTex (bottom) and CheXzero (top), on the EdemaSeverity task. Values represent the mean proportion over three random trials.
  • Figure 4: Report-level pathology precision on MIMIC-CXR test set vs frequency in training data, as a measure of pathology hallucination. $\mathcal{M}^2$ Trans and R2 Gen numbers from miura-etal-2021-improving. Linear regressions and 95% confidence intervals for each RRG method are shown. Pearson correlation coefficients of $r=0.89, 0.78,0.62$ for RadTex, R2Gen, and $\mathcal{M}^2$ Trans, respectively. The CheXpert competition pathologies are labeled for RadTex.