Table of Contents
Fetching ...

Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation

Kang Liu, Zhuoqi Ma, Xiaolu Kang, Yunan Li, Kun Xie, Zhicheng Jiao, Qiguang Miao

TL;DR

The paper tackles chest X-ray report generation by exploiting multi-view longitudinal data and patient-specific priors. It introduces a two-stage framework (MLRG) where Stage 1 performs multi-view longitudinal contrastive learning with learnable view embeddings and cross-modal supervision via losses $L_{MPC}$ and $L_{G}$, and Stage 2 employs tokenized absence encoding to handle missing INDICATION and previous reports, integrating priors through a multi-modal fusion network. The approach achieves state-of-the-art results on MIMIC-CXR, MIMIC-ABN, and Two-view CXR, with notable gains in BLEU-4, F1 RadGraph, and clinical efficacy metrics, demonstrating improved report coherence and clinical accuracy. These advances enable more robust radiology report generation even when prior knowledge is incomplete, potentially enhancing clinical workflow and diagnostic reliability.

Abstract

Automated radiology report generation offers an effective solution to alleviate radiologists' workload. However, most existing methods focus primarily on single or fixed-view images to model current disease conditions, which limits diagnostic accuracy and overlooks disease progression. Although some approaches utilize longitudinal data to track disease progression, they still rely on single images to analyze current visits. To address these issues, we propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG. Specifically, we introduce a multi-view longitudinal contrastive learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data. This method also utilizes the inherent spatiotemporal information of radiology reports to supervise the pre-training of visual and textual representations. Subsequently, we present a tokenized absence encoding technique to flexibly handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge. Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3% BLEU-4 improvement on MIMIC-CXR, a 5.5% F1 score improvement on MIMIC-ABN, and a 2.7% F1 RadGraph improvement on Two-view CXR.

Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation

TL;DR

The paper tackles chest X-ray report generation by exploiting multi-view longitudinal data and patient-specific priors. It introduces a two-stage framework (MLRG) where Stage 1 performs multi-view longitudinal contrastive learning with learnable view embeddings and cross-modal supervision via losses and , and Stage 2 employs tokenized absence encoding to handle missing INDICATION and previous reports, integrating priors through a multi-modal fusion network. The approach achieves state-of-the-art results on MIMIC-CXR, MIMIC-ABN, and Two-view CXR, with notable gains in BLEU-4, F1 RadGraph, and clinical efficacy metrics, demonstrating improved report coherence and clinical accuracy. These advances enable more robust radiology report generation even when prior knowledge is incomplete, potentially enhancing clinical workflow and diagnostic reliability.

Abstract

Automated radiology report generation offers an effective solution to alleviate radiologists' workload. However, most existing methods focus primarily on single or fixed-view images to model current disease conditions, which limits diagnostic accuracy and overlooks disease progression. Although some approaches utilize longitudinal data to track disease progression, they still rely on single images to analyze current visits. To address these issues, we propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG. Specifically, we introduce a multi-view longitudinal contrastive learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data. This method also utilizes the inherent spatiotemporal information of radiology reports to supervise the pre-training of visual and textual representations. Subsequently, we present a tokenized absence encoding technique to flexibly handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge. Extensive experiments on MIMIC-CXR, MIMIC-ABN, and Two-view CXR datasets demonstrate that our MLRG outperforms recent state-of-the-art methods, achieving a 2.3% BLEU-4 improvement on MIMIC-CXR, a 5.5% F1 score improvement on MIMIC-ABN, and a 2.7% F1 RadGraph improvement on Two-view CXR.

Paper Structure

This paper contains 18 sections, 10 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (A) shows medical historical data of a subject (patient) over time. (B) compares inputs for RRG, with AP and PA as frontal views, and Lat and Rep as a lateral view and its report. Ind and MVL Data are "INDICATION" and multi-view longitudinal data.
  • Figure 2: Overview of our proposed MLRG, including a vision encoder (RAD-DINO 2024-rad-dino), a text encoder (CXR-BERT 2022-eccv-cxr-bert), and a text generator (DistilGPT2 Sanh2019DistilBERTAD). MLRG first learns visual features through multi-view longitudinal contrastive learning and then generates radiology reports based on patient-specific prior knowledge.
  • Figure 3: (A) represents the multi-view longitudinal fusion (MLF) network. (B) denotes the multi-modal fusion network.
  • Figure 4: Generated reports examples on the MIMIC-CXR test set. Each "A/B" cell refers to "MLRG/SEI". Sentences in the reference report are highlighted in unique colors to clarify alignment with descriptions in the generated reports. Matching content in generated reports is shown in the same color, while correct temporal descriptions and failure descriptions of our MLRG are in bold and underlined.
  • Figure 5: Comparison with baselines on MIMIC-CXR using LLMs. "#Matched Findings" denotes the number of matched findings between generated and reference reports.
  • ...and 3 more figures