Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation
Kang Liu, Zhuoqi Ma, Xiaolu Kang, Zhusi Zhong, Zhicheng Jiao, Grayson Baird, Harrison Bai, Qiguang Miao
TL;DR
This work tackles chest X-ray report generation (CXRG) by explicitly incorporating patient-specific indications and strengthening cross-modal alignment between images and textual findings. It introduces Structural Entities Extraction and patient Indications Incorporation (SEI), comprising Structural Entities Extraction (SEE) to derive factual entity sequences and a cross-modal fusion network to integrate X-ray imagery, similar historical cases, and patient indications. The approach first pre-trains a cross-modal alignment module using factual sequences, then performs gradient-free retrieval of similar historical cases and fuses them with indications to generate reports, optimized via a negative log-likelihood objective $L_{LM}$. On the MIMIC-CXR dataset, SEI achieves state-of-the-art results across NLG and clinical-efficacy metrics, with ablations confirming the individual and combined value of SEE, similar historical cases, and indications for clinical fidelity and linguistic fluency.
Abstract
The automated generation of imaging reports proves invaluable in alleviating the workload of radiologists. A clinically applicable reports generation algorithm should demonstrate its effectiveness in producing reports that accurately describe radiology findings and attend to patient-specific indications. In this paper, we introduce a novel method, \textbf{S}tructural \textbf{E}ntities extraction and patient indications \textbf{I}ncorporation (SEI) for chest X-ray report generation. Specifically, we employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports and improve the quality of factual entity sequences. This reduces the noise in the following cross-modal alignment module by aligning X-ray images with factual entity sequences in reports, thereby enhancing the precision of cross-modal alignment and further aiding the model in gradient-free retrieval of similar historical cases. Subsequently, we propose a cross-modal fusion network to integrate information from X-ray images, similar historical cases, and patient-specific indications. This process allows the text decoder to attend to discriminative features of X-ray images, assimilate historical diagnostic information from similar cases, and understand the examination intention of patients. This, in turn, assists in triggering the text decoder to produce high-quality reports. Experiments conducted on MIMIC-CXR validate the superiority of SEI over state-of-the-art approaches on both natural language generation and clinical efficacy metrics.
