How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?
Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie
TL;DR
RadA-BenchPlat introduces a large-scale radiology agent benchmarking platform that uses 2,200 synthetic patient records across six anatomical regions, five imaging modalities, and 2,200 disease scenarios to generate 24,200 QA pairs. It benchmarks seven leading LLMs as agent cores across 11 radiology tasks and ten tool categories, revealing that current models struggle with complex task understanding and tool coordination, even though Claude-series models show relative strength. The study shows that adaptive prompting strategies (prompt back-propagation, self-reflection, few-shot learning, multi-agent collaboration) can substantially boost task completion, up to 48.2% for complex tasks, and automated tool-building can recover up to 65.4% of previously unsolvable tasks. The findings highlight practical pathways toward integrated, automated radiology workflows while underscoring the need for clinical oversight and further model/tool development.
Abstract
We introduce RadA-BenchPlat, an evaluation platform that benchmarks the performance of large language models (LLMs) act as agent cores in radiology environments using 2,200 radiologist-verified synthetic patient records covering six anatomical regions, five imaging modalities, and 2,200 disease scenarios, resulting in 24,200 question-answer pairs that simulate diverse clinical situations. The platform also defines ten categories of tools for agent-driven task solving and evaluates seven leading LLMs, revealing that while models like Claude-3.7-Sonnet can achieve a 67.1% task completion rate in routine settings, they still struggle with complex task understanding and tool coordination, limiting their capacity to serve as the central core of automated radiology systems. By incorporating four advanced prompt engineering strategies--where prompt-backpropagation and multi-agent collaboration contributed 16.8% and 30.7% improvements, respectively--the performance for complex tasks was enhanced by 48.2% overall. Furthermore, automated tool building was explored to improve robustness, achieving a 65.4% success rate, thereby offering promising insights for the future integration of fully automated radiology applications into clinical practice. All of our code and data are openly available at https://github.com/MAGIC-AI4Med/RadABench.
