Table of Contents
Fetching ...

CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

Yuren Mao, Wenyi Xu, Yuyang Qin, Yunjun Gao

TL;DR

This work introduces CT-Agent, a multimodal LLM-based agent for 3D CT radiology question answering and report generation. It tackles anatomical complexity and cross-slice reasoning via anatomy-specific LoRA plugins and a global-local token compression strategy, enabling region-aware, token-efficient reasoning. Experiments on CT-RATE and RadGenome-ChestCT show superior performance in both radiology report generation and region-guided QA, with ablations validating planning, token compression, and exemplar retrieval components. The framework advances practical CT-based VQA and reporting by delivering interpretable, region-focused inferences with scalable memory-enabled grounding.

Abstract

Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.

CT-Agent: A Multimodal-LLM Agent for 3D CT Radiology Question Answering

TL;DR

This work introduces CT-Agent, a multimodal LLM-based agent for 3D CT radiology question answering and report generation. It tackles anatomical complexity and cross-slice reasoning via anatomy-specific LoRA plugins and a global-local token compression strategy, enabling region-aware, token-efficient reasoning. Experiments on CT-RATE and RadGenome-ChestCT show superior performance in both radiology report generation and region-guided QA, with ablations validating planning, token compression, and exemplar retrieval components. The framework advances practical CT-based VQA and reporting by delivering interpretable, region-focused inferences with scalable memory-enabled grounding.

Abstract

Computed Tomography (CT) scan, which produces 3D volumetric medical data that can be viewed as hundreds of cross-sectional images (a.k.a. slices), provides detailed anatomical information for diagnosis. For radiologists, creating CT radiology reports is time-consuming and error-prone. A visual question answering (VQA) system that can answer radiologists' questions about some anatomical regions on the CT scan and even automatically generate a radiology report is urgently needed. However, existing VQA systems cannot adequately handle the CT radiology question answering (CTQA) task for: (1) anatomic complexity makes CT images difficult to understand; (2) spatial relationship across hundreds slices is difficult to capture. To address these issues, this paper proposes CT-Agent, a multimodal agentic framework for CTQA. CT-Agent adopts anatomically independent tools to break down the anatomic complexity; furthermore, it efficiently captures the across-slice spatial relationship with a global-local token compression strategy. Experimental results on two 3D chest CT datasets, CT-RATE and RadGenome-ChestCT, verify the superior performance of CT-Agent.

Paper Structure

This paper contains 32 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall architecture of CT-Agent. It consists of three modules: planning module, action space, and memory module. The planning module is driven by a LLM and is responsible for identifying the task type (Report Generation or Anatomy-level QA), parsing user inputs, locating the involved anatomical region, dispatching appropriate tool modules, and planning how to select few-shot exemplars. The action space includes a set of anatomy-aware plugins, each specialized for a different anatomical region, few-shot selection tool and query normalization tool for query selection and query rewriting. The memory module stores historical queries, planning paths, and prior radiology reports or question-answering results.
  • Figure 2: The pipeline of anatomy-aware reasoning tools. Given a set of axial CT slices, a frozen vision encoder extracts slice-level visual tokens. The token representations are processed through two parallel pathways: (1) Global Token Aggregation (GTA) and (2) Local Token Selection (LTS). The fused tokens are projected through a linear layer and combined with text tokens from the query to form a multimodal input sequence. The final response is generated by a pretrained language model augmented with LoRA plugins, enabling anatomy-specific reasoning and clinically accurate output. During training, the vision encoder and text encoder remain frozen, while the MoE module, Projector, and LoRA adapters are optimized.
  • Figure 3: Semantic retrieval pipeline for few-shot prompting. The current case’s anatomy-level outputs are encoded into a semantic vector and matched against a vector index constructed from historical reports. Top-matching exemplars are retrieved and prepended to the input prompt for final report generation.
  • Figure 4: Sentence-level comparison among the CT-Agent generated report, baseline generated report and the reference report (ground_truth). Green highlights indicate consistent findings between the two reports. Blue highlights represent mismatches or deviations: When applied to the ground truth, they mark clinically important details that the model failed to include. When appearing in the generated report, they highlight content that does not exist in the reference report, which may suggest redundancy or hallucination. Red highlights indicate statements in the generated report that are factually incorrect or contradictory to the ground truth. Yellow background indicates internally contradictory content within the same generated report.