CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering
Yuren Mao, Wenyi Xu, Yuyang Qin, Yunjun Gao
TL;DR
This work introduces CT-Agent, a multimodal LLM-based agent for 3D CT radiology question answering and report generation. It tackles anatomical complexity and cross-slice reasoning via anatomy-specific LoRA plugins and a global-local token compression strategy, enabling region-aware, token-efficient reasoning. Experiments on CT-RATE and RadGenome-ChestCT show superior performance in both radiology report generation and region-guided QA, with ablations validating planning, token compression, and exemplar retrieval components. The framework advances practical CT-based VQA and reporting by delivering interpretable, region-focused inferences with scalable memory-enabled grounding.
Abstract
Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.
