Table of Contents
Fetching ...

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

Yannian Gu, Xizhuo Zhang, Linjie Mu, Yongrui Yu, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang

TL;DR

CT-Flow is proposed, an agentic framework designed for interoperable volumetric interpretation designed for interoperable volumetric interpretation that provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers

TL;DR

CT-Flow is proposed, an agentic framework designed for interoperable volumetric interpretation designed for interoperable volumetric interpretation that provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have shown strong potential for multi-modal radiological reasoning, particularly in tasks like diagnostic visual question answering (VQA) and radiology report generation. However, most existing approaches for 3D CT analysis largely rely on static, single-pass inference. In practice, clinical interpretation is a dynamic, tool-mediated workflow where radiologists iteratively review slices and use measurement, radiomics, and segmentation tools to refine findings. To bridge this gap, we propose CT-Flow, an agentic framework designed for interoperable volumetric interpretation. By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm. We curate CT-FlowBench, the first large-scale instruction-tuning benchmark tailored for 3D CT tool-use and multi-step reasoning. Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences. Experimental evaluations on CT-FlowBench and standard 3D VQA datasets demonstrate that CT-Flow achieves state-of-the-art performance, surpassing baseline models by 41% in diagnostic accuracy and achieving a 95% success rate in autonomous tool invocation. This work provides a scalable foundation for integrating autonomous, agentic intelligence into real-world clinical radiology.
Paper Structure (44 sections, 2 equations, 5 figures, 5 tables)

This paper contains 44 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of 3D CT analysis paradigms. Left: Traditional End-to-End LVLMs rely on passive visual ingestion of 3D data, resulting in static textual outputs. Right: The proposed CT-Flow framework leverages the Model Context Protocol to transform the LLM into an active agent. It dynamically orchestrates specialized tools to deliver precise, multi-modal diagnosis.
  • Figure 2: Overview of the CT-Flow framework. (i) Data Construction: The pipeline for raw data curation, trajectory synthesis, and the establishment of the CT-Flow benchmark. (ii) Architectures: The system decouples the LLM orchestrator from the imaging environment via FASTMCP, bridging high-level servers with medical imaging infrastructures to provide a suite of atomic tools in the Tool Space. (iii) Case Study: A demonstration of a Language-Action Trajectory $\mathcal{T}$. The orchestrator performs Active Probing by iteratively generating reasoning states ($s_t$), executing tool calls ($a_t$), and interpreting high-fidelity observations ($o_t$) to reach a grounded diagnostic answer.
  • Figure 3: Comparative performance of various models using the CT-Flow framework vs. the slice-based baseline.
  • Figure 4: Performance impact of tool category ablation. Bars indicate accuracy (%) and the red line tracks format errors. Removing specific tool classes (cls. 2-4) leads to decreased diagnostic accuracy and increased errors across all tasks, validating the necessity of the full hierarchical toolset.
  • Figure 5: Structure of the System Prompt. The core principles, critical thinking rules, and standard operating procedures (SOPs) for the AI Medical Imaging Assistant are displayed. To facilitate a clear presentation, the specific technical definitions of available tools have been truncated.