SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science
Wonduk Seo, Juhyeon Lee, Yanjun Shao, Qingshan Zhou, Seunghyun Lee, Yi Bu
TL;DR
SPIO tackles the rigidity of traditional AutoML by introducing sequential plan integration and optimization across four pipeline modules, enabling adaptive multi-path exploration. It combines fundamental code-generation agents with an LLM-driven sequential planner, delivering SPIO-S (single best plan) and SPIO-E (top-$k$ ensemble) variants. Across Kaggle/OpenML benchmarks and multiple LLM backends, SPIO yields an average improvement of $5.6\%$ over strong baselines, with ablations showing that feature engineering and hyperparameter tuning are key performance drivers. The work advances automated data science by providing a transparent, robust framework that balances exploration, fidelity, and efficiency.
Abstract
Large Language Models (LLMs) have enabled dynamic reasoning in automated data analytics, yet recent multi-agent systems remain limited by rigid, single-path workflows that restrict strategic exploration and often lead to suboptimal outcomes. To overcome these limitations, we propose SPIO (Sequential Plan Integration and Optimization), a framework that replaces rigid workflows with adaptive, multi-path planning across four core modules: data preprocessing, feature engineering, model selection, and hyperparameter tuning. In each module, specialized agents generate diverse candidate strategies, which are cascaded and refined by an optimization agent. SPIO offers two operating modes: SPIO-S for selecting a single optimal pipeline, and SPIO-E for ensembling top-k pipelines to maximize robustness. Extensive evaluations on Kaggle and OpenML benchmarks show that SPIO consistently outperforms state-of-the-art baselines, achieving an average performance gain of 5.6%. By explicitly exploring and integrating multiple solution paths, SPIO delivers a more flexible, accurate, and reliable foundation for automated data science.
