CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging
Ibrahim Ethem Hamamci, Sezgin Er, Bjoern Menze
TL;DR
CT2Rep introduces the first framework for automated radiology report generation from 3D chest CT volumes, addressing the lack of 3D-capable methods and data scarcity. It employs a novel auto-regressive causal transformer as a 3D vision feature extractor, a transformer encoder, and a transformer decoder with relational memory and memory-driven conditional layer normalization to generate reports from volumetric data $x \in \mathbb{R}^{240\times480\times480}$ mapped to a report sequence over vocabulary $\mathbb{V}$. The authors further extend the model with CT2RepLong, a longitudinal multimodal fusion system using cross-attention between prior volumes $x^{old}$ and prior reports $r^{old}$ to improve current descriptions. Evaluations on the CT-RATE dataset show CT2Rep outperforms a 3D vision-encoder baseline (CT-Net) on NLG and clinical-efficacy metrics, while CT2RepLong provides additional gains by leveraging historical information; the work is open-sourced to accelerate 3D radiology reporting research.
Abstract
Medical imaging plays a crucial role in diagnosis, with radiology reports serving as vital documentation. Automating report generation has emerged as a critical need to alleviate the workload of radiologists. While machine learning has facilitated report generation for 2D medical imaging, extending this to 3D has been unexplored due to computational complexity and data scarcity. We introduce the first method to generate radiology reports for 3D medical imaging, specifically targeting chest CT volumes. Given the absence of comparable methods, we establish a baseline using an advanced 3D vision encoder in medical imaging to demonstrate our method's effectiveness, which leverages a novel auto-regressive causal transformer. Furthermore, recognizing the benefits of leveraging information from previous visits, we augment CT2Rep with a cross-attention-based multi-modal fusion module and hierarchical memory, enabling the incorporation of longitudinal multimodal data. Access our code at https://github.com/ibrahimethemhamamci/CT2Rep
