Table of Contents
Fetching ...

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

Xiao Wang, Yuehang Li, Fuling Wang, Shiao Wang, Chuanfu Li, Bo Jiang

TL;DR

R2GenCSR tackles efficient X-ray medical report generation by marrying a linear-complexity vision backbone (Mamba/VMamba) with training-time retrieval of context samples (positive/negative) to guide an LLM through structured prompts. Context residuals derived from retrieved samples, combined with tokenized visual/text cues, are fed to an instruction-tuned LLM, optimized via a cross-entropy objective. The method achieves competitive or superior performance on IU-Xray, MIMIC-CXR, and CheXpert Plus across standard metrics, while reducing computational burden relative to Transformer-based backbones. This approach demonstrates the practical value of context-aware, retrieval-augmented generation for radiology reports and suggests avenues for incorporating domain knowledge and more advanced retrieval techniques in future work.

Abstract

Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

TL;DR

R2GenCSR tackles efficient X-ray medical report generation by marrying a linear-complexity vision backbone (Mamba/VMamba) with training-time retrieval of context samples (positive/negative) to guide an LLM through structured prompts. Context residuals derived from retrieved samples, combined with tokenized visual/text cues, are fed to an instruction-tuned LLM, optimized via a cross-entropy objective. The method achieves competitive or superior performance on IU-Xray, MIMIC-CXR, and CheXpert Plus across standard metrics, while reducing computational burden relative to Transformer-based backbones. This approach demonstrates the practical value of context-aware, retrieval-augmented generation for radiology reports and suggests avenues for incorporating domain knowledge and more advanced retrieval techniques in future work.

Abstract

Inspired by the tremendous success of Large Language Models (LLMs), existing X-ray medical report generation methods attempt to leverage large models to achieve better performance. They usually adopt a Transformer to extract the visual features of a given X-ray image, and then, feed them into the LLM for text generation. How to extract more effective information for the LLMs to help them improve final results is an urgent problem that needs to be solved. Additionally, the use of visual Transformer models also brings high computational complexity. To address these issues, this paper proposes a novel context-guided efficient X-ray medical report generation framework. Specifically, we introduce the Mamba as the vision backbone with linear complexity, and the performance obtained is comparable to that of the strong Transformer model. More importantly, we perform context retrieval from the training set for samples within each mini-batch during the training phase, utilizing both positively and negatively related samples to enhance feature representation and discriminative learning. Subsequently, we feed the vision tokens, context information, and prompt statements to invoke the LLM for generating high-quality medical reports. Extensive experiments on three X-ray report generation datasets (i.e., IU-Xray, MIMIC-CXR, CheXpert Plus) fully validated the effectiveness of our proposed model. The source code of this work will be released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
Paper Structure (22 sections, 7 equations, 5 figures, 10 tables)

This paper contains 22 sections, 7 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison between (a-c). existing X-ray report generation frameworks and (d). our newly proposed one; (e). t-SNE feature distribution of our sampled positive and negative context samples from IU-Xray dataset.
  • Figure 2: An overview of our proposed context sample augmented large language model for efficient X-ray medical report generation, termed R2GenCSR. Three main modules are involved in this framework, including the Mamba vision backbone, context sample retrieval, and large language model (LLM). We first extract the visual tokens of the input X-ray image using the Mamba backbone, then, retrieve context samples from the training subset. We get the residual tokens by subtracting the tokens of the input image and its context samples. The LLM takes the vision tokens, context residual tokens, and prompt statements as input and generates a high-quality medical report.
  • Figure 3: Ablation of language model on IU-Xray dataset.
  • Figure 4: X-ray image and feature map and its corresponding report on the MIMIC-CXR dataset.
  • Figure 5: X-ray image and its corresponding ground truth, along with the output of our model generation report on the MIMIC-CXR dataset. The mismatch sentence in the reports are highlighted using different colors.