DatawiseAgent: A Notebook-Centric LLM Agent Framework for Adaptive and Robust Data Science Automation
Ziming You, Yumiao Zhang, Dexuan Xu, Yiwei Lou, Yandong Yan, Wei Wang, Huaming Zhang, Yu Huang
TL;DR
DatawiseAgent addresses the challenge of end-to-end data science automation by introducing a notebook-centric LLM agent that unifies agent-user-environment interaction into notebook cells and governs behavior with a non-deterministic finite-state transducer across four stages. The framework enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures via DFS-like planning, incremental execution, self-debugging, and post-filtering. Across three diverse data science tasks and multiple LLMs, it achieves state-of-the-art performance and demonstrates robustness to model capability and scale, while maintaining favorable cost-performance trade-offs. This approach strengthens practical deployment of autonomous data science agents in resource-constrained settings and aligns closely with standard notebook workflows used by data scientists.
Abstract
Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.
