Table of Contents
Fetching ...

How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?

Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

RadA-BenchPlat introduces a large-scale radiology agent benchmarking platform that uses 2,200 synthetic patient records across six anatomical regions, five imaging modalities, and 2,200 disease scenarios to generate 24,200 QA pairs. It benchmarks seven leading LLMs as agent cores across 11 radiology tasks and ten tool categories, revealing that current models struggle with complex task understanding and tool coordination, even though Claude-series models show relative strength. The study shows that adaptive prompting strategies (prompt back-propagation, self-reflection, few-shot learning, multi-agent collaboration) can substantially boost task completion, up to 48.2% for complex tasks, and automated tool-building can recover up to 65.4% of previously unsolvable tasks. The findings highlight practical pathways toward integrated, automated radiology workflows while underscoring the need for clinical oversight and further model/tool development.

Abstract

We introduce RadA-BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist-verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question-answer pairs that simulate diverse clinical situations. The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs, revealing that while models like Claude-3.7-Sonnet can achieve a 67.1% task completion rate in routine settings, they still struggle with complex task understanding and tool coordination, limiting their capacity to serve as the central core of automated radiology systems. By incorporating four advanced prompt engineering strategies--where prompt-backpropagation and multi-agent collaboration contributed 16.8% and 30.7% improvements, respectively--the performance for complex tasks was enhanced by 48.2% overall. Furthermore, automated tool building was explored to improve robustness, achieving a 65.4% success rate, thereby offering promising insights for the future integration of fully automated radiology applications into clinical practice. All of our code and data are openly available at https://github.com/MAGIC-AI4Med/RadABench.

How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?

TL;DR

RadA-BenchPlat introduces a large-scale radiology agent benchmarking platform that uses 2,200 synthetic patient records across six anatomical regions, five imaging modalities, and 2,200 disease scenarios to generate 24,200 QA pairs. It benchmarks seven leading LLMs as agent cores across 11 radiology tasks and ten tool categories, revealing that current models struggle with complex task understanding and tool coordination, even though Claude-series models show relative strength. The study shows that adaptive prompting strategies (prompt back-propagation, self-reflection, few-shot learning, multi-agent collaboration) can substantially boost task completion, up to 48.2% for complex tasks, and automated tool-building can recover up to 65.4% of previously unsolvable tasks. The findings highlight practical pathways toward integrated, automated radiology workflows while underscoring the need for clinical oversight and further model/tool development.

Abstract

We introduce RadA-BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist-verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question-answer pairs that simulate diverse clinical situations. The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs, revealing that while models like Claude-3.7-Sonnet can achieve a 67.1% task completion rate in routine settings, they still struggle with complex task understanding and tool coordination, limiting their capacity to serve as the central core of automated radiology systems. By incorporating four advanced prompt engineering strategies--where prompt-backpropagation and multi-agent collaboration contributed 16.8% and 30.7% improvements, respectively--the performance for complex tasks was enhanced by 48.2% overall. Furthermore, automated tool building was explored to improve robustness, achieving a 65.4% success rate, thereby offering promising insights for the future integration of fully automated radiology applications into clinical practice. All of our code and data are openly available at https://github.com/MAGIC-AI4Med/RadABench.

Paper Structure

This paper contains 13 sections, 12 figures.

Figures (12)

  • Figure 1: Benchmark overview.a. Overview of the proposed RadABench. The left panel highlights our primary concern and summarizes the observed results The right panel details the key components integrated into the benchmarking platform. b. Main radiology task completion results evaluated across LLM comparisons and enhanced methods, revealing that the Claude-series models perform best—albeit at a moderate level overall—and that enhanced prompting methods yield significant improvements.
  • Figure 1: A detailed explanation of 20 important terminologies used in this paper.a. Basic terms commonly used throughout the paper. b. Sophisticated definitions of tool-related concepts. c. Agent-related terms including five key abilities. d. Four radiology environments featuring various tool combinations for different evaluation purposes.
  • Figure 2: Statistics in RadA-BenchPlat.a. Distribution of patient sex across different anatomical regions. b. Distribution of patient records across six anatomical regions categorized by age, height, and weight ranges. c. Ten categories of tools and eleven types of radiology tasks, with checkmarks indicating tools required for task completion. Tool sets are simulated through various tool attribute combinations. d. 24,200 generated QA-pairs based on radiology task types using GPT-4. e. Number of tools across eight simulated tool set conditions. f. Maximum token length of the open-source and closed-source LLMs used in this study.
  • Figure 2: Features distribution on anomaly/disease/biomarker/indicator extracted by BioLORD and MedCPT.
  • Figure 3: Main results of task completion performance. a. Three-level task complexity examples categorized as easy, medium, and complex. b. Main results from state-of-the-art LLMs using various prompt engineering strategies, with the Integration approach (Sonnet-3.7 combined with four prompt improvement methods) demonstrating superior performance compared to other approaches. c. Statistical significance of performance improvements shown through p-values and t-values between involved prompt engineering strategies.
  • ...and 7 more figures