Act Like a Radiologist: Radiology Report Generation across Anatomical Regions
Qi Chen, Yutong Xie, Biao Wu, Xiaomin Chen, James Ang, Minh-Son To, Xiaojun Chang, Qi Wu
TL;DR
This work tackles radiology report generation across multiple anatomical regions, addressing chest-centric data limitations and semantic drift in cross-dataset deployment. It introduces X-RGen, a radiologist-minded framework that proceeds through four phases—initial observation, cross-region analysis, medical interpretation, and report formation—coupled with a general radiological knowledge base and region-aware knowledge selection. A cross-region learning objective aligns image and report representations across body parts, while a Transformer-based knowledge aggregation and decoder generate medically informed reports; training integrates a captioning objective with a cross-region loss via $\\mathcal{L}_{cap}$ and $\\lambda \\mathcal{L}_{x}$. Evaluations on a merged six-region X-ray dataset (including IU-Xray for chest) show X-RGen outperforms specialised and generalist baselines on NLG metrics (BLEU, CIDEr, METEOR) and clinical measures (recall, F1), with evidence from qualitative examples, CLIPScore-based semantic alignment, and CheXpert-based recognition probes. The results underscore the impact of cross-region learning and medical-knowledge integration for robust, clinically relevant radiology report generation with broader applicability beyond chest imaging.
Abstract
Automating radiology report generation can ease the reporting workload for radiologists. However, existing works focus mainly on the chest area due to the limited availability of public datasets for other regions. Besides, they often rely on naive data-driven approaches, e.g., a basic encoder-decoder framework with captioning loss, which limits their ability to recognise complex patterns across diverse anatomical regions. To address these issues, we propose X-RGen, a radiologist-minded report generation framework across six anatomical regions. In X-RGen, we seek to mimic the behaviour of human radiologists, breaking them down into four principal phases: 1) initial observation, 2) cross-region analysis, 3) medical interpretation, and 4) report formation. Firstly, we adopt an image encoder for feature extraction, akin to a radiologist's preliminary review. Secondly, we enhance the recognition capacity of the image encoder by analysing images and reports across various regions, mimicking how radiologists gain their experience and improve their professional ability from past cases. Thirdly, just as radiologists apply their expertise to interpret radiology images, we introduce radiological knowledge of multiple anatomical regions to further analyse the features from a clinical perspective. Lastly, we generate reports based on the medical-aware features using a typical auto-regressive text decoder. Both natural language generation (NLG) and clinical efficacy metrics show the effectiveness of X-RGen on six X-ray datasets. Our code and checkpoints are available at: https://github.com/YtongXie/X-RGen.
