Table of Contents
Fetching ...

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

Hao Chen, Wei Zhao, Yingli Li, Tianyang Zhong, Yisong Wang, Youlan Shang, Lei Guo, Junwei Han, Tianming Liu, Jun Liu, Tuo Zhang

TL;DR

This work tackles the challenge of generating reliable radiology reports from 3D chest CT scans by introducing 3D-CT-GPT, a Visual Question Answering–based medical visual language model that fuses a CT-ViT encoder, 3D average pooling, and a projection layer to align visual features with a language model. The model is trained in a data-efficient two-stage process (pre-training on public data followed by fine-tuning on private data) and evaluated against published and private baselines, showing superior report accuracy and coherence. Across both direct and indirect comparisons, 3D-CT-GPT achieves higher BLEU, ROUGE, METEOR, and BERTScore metrics, with ablations confirming the value of fine-tuning and an effective projection strategy. The results imply strong potential for clinical deployment, particularly in resource-constrained settings, and lay groundwork for broader 3D radiology report generation and scalable VQA in medical imaging.

Abstract

Medical image analysis is crucial in modern radiological diagnostics, especially given the exponential growth in medical imaging data. The demand for automated report generation systems has become increasingly urgent. While prior research has mainly focused on using machine learning and multimodal language models for 2D medical images, the generation of reports for 3D medical images has been less explored due to data scarcity and computational complexities. This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model specifically designed for generating radiology reports from 3D CT scans, particularly chest CTs. Extensive experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality. Although current methods are few, including the partially open-source CT2Rep and the open-source M3D, we ensured fair comparison through appropriate data conversion and evaluation methodologies. Experimental results indicate that 3D-CT-GPT enhances diagnostic accuracy and report coherence, establishing itself as a robust solution for clinical radiology report generation. Future work will focus on expanding the dataset and further optimizing the model to enhance its performance and applicability.

3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models

TL;DR

This work tackles the challenge of generating reliable radiology reports from 3D chest CT scans by introducing 3D-CT-GPT, a Visual Question Answering–based medical visual language model that fuses a CT-ViT encoder, 3D average pooling, and a projection layer to align visual features with a language model. The model is trained in a data-efficient two-stage process (pre-training on public data followed by fine-tuning on private data) and evaluated against published and private baselines, showing superior report accuracy and coherence. Across both direct and indirect comparisons, 3D-CT-GPT achieves higher BLEU, ROUGE, METEOR, and BERTScore metrics, with ablations confirming the value of fine-tuning and an effective projection strategy. The results imply strong potential for clinical deployment, particularly in resource-constrained settings, and lay groundwork for broader 3D radiology report generation and scalable VQA in medical imaging.

Abstract

Medical image analysis is crucial in modern radiological diagnostics, especially given the exponential growth in medical imaging data. The demand for automated report generation systems has become increasingly urgent. While prior research has mainly focused on using machine learning and multimodal language models for 2D medical images, the generation of reports for 3D medical images has been less explored due to data scarcity and computational complexities. This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model specifically designed for generating radiology reports from 3D CT scans, particularly chest CTs. Extensive experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality. Although current methods are few, including the partially open-source CT2Rep and the open-source M3D, we ensured fair comparison through appropriate data conversion and evaluation methodologies. Experimental results indicate that 3D-CT-GPT enhances diagnostic accuracy and report coherence, establishing itself as a robust solution for clinical radiology report generation. Future work will focus on expanding the dataset and further optimizing the model to enhance its performance and applicability.
Paper Structure (31 sections, 6 equations, 4 figures, 2 tables)

This paper contains 31 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison between: (a) Existing models like RadMD and M3D-LaMed, and (b) Our 3D-CT-GPT, that uniquely combines CT ViT, 3D Average Pooling and a projection layer (dashed box) to enhance report generation from 3D CT scans, improving on past methods.
  • Figure 2: Overview of the 3D-CT-GPT model architecture, featuring three key components: (a) 3D CT Image Encoder utilizing CT-ViT for feature extraction; (b) Linear Projection Layer for feature transformation; (c) Integration of Vision and Language Models for generating contextually relevant radiology reports.
  • Figure 3: Qualitative comparison of report generation between 3D-CT-GPT and the ground truth. The generated reports from three different training strategies (T1, T2, T3) are compared against the true report. Text highlighted in the same color indicates similar content between the generated reports and the ground truth.
  • Figure 4: Effect of Temperature on T2 Model Performance across different metrics (BLEU, ROUGE, METEOR).