Table of Contents
Fetching ...

Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

Christian Bluethgen, Dave Van Veen, Daniel Truhn, Jakob Nikolas Kather, Michael Moor, Malgorzata Polacin, Akshay Chaudhari, Thomas Frauenfelder, Curtis P. Langlotz, Michael Krauthammer, Farhad Nooralahzadeh

TL;DR

This paper argues that radiology stands to gain from LLM-based agentic systems that can reason, plan, and act across multi-step tasks by integrating external tools and memory. It surveys the technical foundations (agents, tools, grounding, memory, and design patterns), frames radiology as a complex environment with rich knowledge sources and interoperable infrastructure, and presents concrete applications from report drafting to MDT discussions. A multi-tier evaluation framework (planning, execution, outcome, system-level) is proposed to capture complex, open-ended performance, complemented by benchmarks and guidelines to advance safe, effective deployment. The authors highlight significant challenges—LLM limits, cascading errors, multi-agent coordination, and governance—arguing that careful design, robust evaluation, and human-AI collaboration are essential for translating agentic radiology from prototype to clinical impact.

Abstract

Building agents, systems that perceive and act upon their environment with a degree of autonomy, has long been a focus of AI research. This pursuit has recently become vastly more practical with the emergence of large language models (LLMs) capable of using natural language to integrate information, follow instructions, and perform forms of "reasoning" and planning across a wide range of tasks. With its multimodal data streams and orchestrated workflows spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks. In radiology, LLMs and their multimodal variants have already demonstrated promising performance for individual tasks such as information extraction and report summarization. However, using LLMs in isolation underutilizes their potential to support complex, multi-step workflows where decisions depend on evolving context from multiple information sources. Equipping LLMs with external tools and feedback mechanisms enables them to drive systems that exhibit a spectrum of autonomy, ranging from semi-automated workflows to more adaptive agents capable of managing complex processes. This review examines the design of such LLM-driven agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.

Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges

TL;DR

This paper argues that radiology stands to gain from LLM-based agentic systems that can reason, plan, and act across multi-step tasks by integrating external tools and memory. It surveys the technical foundations (agents, tools, grounding, memory, and design patterns), frames radiology as a complex environment with rich knowledge sources and interoperable infrastructure, and presents concrete applications from report drafting to MDT discussions. A multi-tier evaluation framework (planning, execution, outcome, system-level) is proposed to capture complex, open-ended performance, complemented by benchmarks and guidelines to advance safe, effective deployment. The authors highlight significant challenges—LLM limits, cascading errors, multi-agent coordination, and governance—arguing that careful design, robust evaluation, and human-AI collaboration are essential for translating agentic radiology from prototype to clinical impact.

Abstract

Building agents, systems that perceive and act upon their environment with a degree of autonomy, has long been a focus of AI research. This pursuit has recently become vastly more practical with the emergence of large language models (LLMs) capable of using natural language to integrate information, follow instructions, and perform forms of "reasoning" and planning across a wide range of tasks. With its multimodal data streams and orchestrated workflows spanning multiple systems, radiology is uniquely suited to benefit from agents that can adapt to context and automate repetitive yet complex tasks. In radiology, LLMs and their multimodal variants have already demonstrated promising performance for individual tasks such as information extraction and report summarization. However, using LLMs in isolation underutilizes their potential to support complex, multi-step workflows where decisions depend on evolving context from multiple information sources. Equipping LLMs with external tools and feedback mechanisms enables them to drive systems that exhibit a spectrum of autonomy, ranging from semi-automated workflows to more adaptive agents capable of managing complex processes. This review examines the design of such LLM-driven agentic systems, highlights key applications, discusses evaluation methods for planning and tool use, and outlines challenges such as error cascades, tool-use efficiency, and health IT integration.

Paper Structure

This paper contains 29 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Conceptual architecture of a radiology-focused LLM-based agent. An initial input (1) provides the task and context. The agent then enters a cycle of obtaining observations (2), reasoning and planning over context, and performing actions (3) on the environment, such as tool calls or database queries. This cycle continues until a final output (4) is produced. The agent comprises an LLM, a framework, and a working memory. An optional agent-owned long-term memory stores episodic (past interactions) and semantic (factual knowledge) information to support retrieval and learning. The agent interacts with its environment (green box), including external systems (e.g., HIS/RIS/PACS, EHR, databases), general and radiology-specific tools, humans, and other AI agents, via defined interfaces (e.g., Model Context Protocol (MCP), Agent-to-Agent (A2A)) and safeguards (e.g., PHI redaction, input validation). AI: Artificial Intelligence. EHR: Electronic Health Record. HIS: Hospital Information System. LLM: Large Language Model. PACS: Picture Archiving and Communication System. PHI: Protected Health Information. RIS: Radiology Information System.
  • Figure 2: Overview of building blocks and design patterns for LLM-based agentic systems. The illustrated components are modular rather than mutually exclusive and can be combined to arbitrary complexity. (Left column) A single LLM call forms the basic building block, optionally reading from or writing to external tools or memory. (Center column) Multiple LLM calls can form workflows through (i) chaining in fixed sequences, (ii) routing and/or parallelization by an orchestrator LLM, or (iii) aggregation and refinement by an evaluator that synthesizes or rejects results. (Right column) Agent systems extend this pattern: a single agent interacts with its environment through observation–reason–action loops, while multi-agent systems organize agents hierarchically (e.g., manager–subagent), collaboratively (specialized roles), or sequentially. LLM: large language model.
  • Figure 3: Example of an agentic workflow for report drafting within traditional radiology infrastructure. The workflow begins with a radiologist selecting a study (Step 1), triggering an FHIRcast event that notifies the agent. The agent plans the reporting task by retrieving prior studies from PACS through a DICOM MCP server (Step 2). It then calls a chest X-ray foundation model (CXR FM) via a custom MCP server to analyze the current and prior images. Once the model returns findings, the agent retrieves a structured report template from a template database via a database MCP server and populates it with model output. The structured draft report is sent to the radiology information system (RIS) using the FHIR protocol via an FHIR MCP server. The workflow concludes when the radiologist receives and reviews the draft report (Step 3). The agent continuously observes, reasons, and takes actions throughout the process via Model Context Protocol (MCP) interactions. DICOM: Digital Imaging and Communications in Medicine. CXR: Chest X-Ray. API: Application Programming Interface. FHIR: Fast Healthcare Interoperability Resources. FM: Foundation Model. MCP: Model Context Protocol. PACS: Picture Archiving and Communication System. RIS: Radiology Information System.
  • Figure 4: High-level Evaluation Framework for Agentic Workflows. This framework decomposes agent behavior into four tiers: Planning, Execution, Outcome and System-level evaluation (not shown). Planning assesses task identification and strategy formulation; Execution evaluates reasoning and decision-making, tool use, and memory management at each step of iterative cycles; and Outcome measures overall task success and termination quality. Note this figure omits system-level performance evaluation (e.g., costs, long-term effects) for clarity.