M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging
Jinghao Feng, Qiaoyu Zheng, Chaoyi Wu, Ziheng Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie
TL;DR
Medical imaging ML pipeline automation faces tool scarcity and domain heterogeneity. M3Builder introduces a four-LLM agent framework operating in a Medical Imaging ML workspace to autonomously construct end-to-end ML pipelines from data preparation to deployment. The authors propose M3Bench to benchmark automated ML across 14 tasks, 14 datasets, five anatomies, and three modalities, and evaluate seven agent-core LLMs, with Claude-3.7-Sonnet achieving a 94.29% success rate. Compared with existing agentic systems, M3Builder shows higher effectiveness and efficiency due to structured workspace design and cross-agent collaboration, signaling a path toward fully automated medical imaging AI tools. The work lays groundwork for expanding to broader medical tasks and integrating enhanced tool-building and visual processing capabilities.
Abstract
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well-prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi-agent system designed to automate machine learning (ML) in medical imaging. At its core, M3Builder employs four specialized agents that collaborate to tackle complex, multi-step medical ML workflows, from automated data processing and environment configuration to self-contained auto debugging and model training. These agents operate within a medical imaging ML workspace, a structured environment designed to provide agents with free-text descriptions of datasets, training codes, and interaction tools, enabling seamless communication and task execution. (ii) To evaluate progress in automated medical imaging ML, we propose M3Bench, a benchmark comprising four general tasks on 14 training datasets, across five anatomies and three imaging modalities, covering both 2D and 3D data. (iii) We experiment with seven state-of-the-art large language models serving as agent cores for our system, such as Claude series, GPT-4o, and DeepSeek-V3. Compared to existing ML agentic designs, M3Builder shows superior performance on completing ML tasks in medical imaging, achieving a 94.29% success rate using Claude-3.7-Sonnet as the agent core, showing huge potential towards fully automated machine learning in medical imaging.
