Table of Contents
Fetching ...

Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation

Xiao Wang, Fuling Wang, Haowen Wang, Bo Jiang, Chuanfu Li, Yaowei Wang, Yonghong Tian, Jin Tang

TL;DR

This work tackles the gap in radiology report generation where LLM-based systems produce fluent text but miss key disease information. It introduces AM-MRG, a two-stage framework that first mines disease-specific visual tokens from X-ray images using Swin Transformer, Q-Former, GradCAM ROIs, and a disease query mechanism, then augments these features with two Modern Hopfield networks operating on disease-visual and report memories. The LLM-based generator uses the enhanced features and a generation prompt to produce accurate, clinically relevant reports, with training conducted in a staged manner over multi-label classification and autoregressive objectives. Across IU X-ray, MIMIC-CXR, and Chexpert Plus, AM-MRG achieves state-of-the-art results on NLG metrics and CE-driven clinical accuracy, with extensive ablations confirming the contribution of each component. The approach offers a practical path toward more reliable, disease-aware radiology reports and highlights avenues for integrating memory-augmented visual-language models with medical knowledge graphs for future work.

Abstract

X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.

Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation

TL;DR

This work tackles the gap in radiology report generation where LLM-based systems produce fluent text but miss key disease information. It introduces AM-MRG, a two-stage framework that first mines disease-specific visual tokens from X-ray images using Swin Transformer, Q-Former, GradCAM ROIs, and a disease query mechanism, then augments these features with two Modern Hopfield networks operating on disease-visual and report memories. The LLM-based generator uses the enhanced features and a generation prompt to produce accurate, clinically relevant reports, with training conducted in a staged manner over multi-label classification and autoregressive objectives. Across IU X-ray, MIMIC-CXR, and Chexpert Plus, AM-MRG achieves state-of-the-art results on NLG metrics and CE-driven clinical accuracy, with extensive ablations confirming the contribution of each component. The approach offers a practical path toward more reliable, disease-aware radiology reports and highlights avenues for integrating memory-augmented visual-language models with medical knowledge graphs for future work.

Abstract

X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
Paper Structure (20 sections, 12 equations, 6 figures, 10 tables)

This paper contains 20 sections, 12 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: A comparison between the process of professional doctors analyzing X-ray images and writing medical reports, and the proposed associative memory-enhanced X-ray LLM report generation framework.
  • Figure 2: An overview of our proposed Associative Memory augmented LLM for X-ray Medical Report Generation, termed AM-MRG. The first stage mainly mining the disease-aware visual tokens based on activation maps. The second stage attempts to augment the large language model-based X-ray medical reporter using associative memory. P.E. and F.E. are short for Position Encoding and Feature Embedding, respectively.
  • Figure 3: Illustration of compute procedure of Modern Hopfield Network (MHN).
  • Figure 4: The gradcam activation maps and the mined visual tokens in stage 1.
  • Figure 5: An illustration of the disease-aware findings of the baseline and our newly proposed model.
  • ...and 1 more figures