Table of Contents
Fetching ...

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Zhuoyun Yu, Shuofei Qiao, Jintian Zhang, Da Zheng, Yuren Mao, Yunjun Gao, Huajun Chen, Ningyu Zhang

TL;DR

AutoMind addresses core limitations of LLM-driven data science agents—namely rigid workflows and lack of empirical human expertise—by grounding reasoning in a curated expert knowledge base, employing an agentic knowledgeable tree search, and applying a self-adaptive coding strategy. The framework demonstrates superior performance on two automated benchmarks (MLE-Bench and Top AI Competitions) and shows improved efficiency with lower token costs. Ablation studies confirm the value of expert knowledge and adaptive coding for complex tasks, while case studies illustrate practical gains on real-world data-science challenges. Overall, AutoMind represents a robust and efficient step toward fully automated, knowledge-driven data science automation with broad applicability to future AI-enabled scientific discovery.

Abstract

Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. Code is at https://github.com/innovatingAI/AutoMind.

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

TL;DR

AutoMind addresses core limitations of LLM-driven data science agents—namely rigid workflows and lack of empirical human expertise—by grounding reasoning in a curated expert knowledge base, employing an agentic knowledgeable tree search, and applying a self-adaptive coding strategy. The framework demonstrates superior performance on two automated benchmarks (MLE-Bench and Top AI Competitions) and shows improved efficiency with lower token costs. Ablation studies confirm the value of expert knowledge and adaptive coding for complex tasks, while case studies illustrate practical gains on real-world data-science challenges. Overall, AutoMind represents a robust and efficient step toward fully automated, knowledge-driven data science automation with broad applicability to future AI-enabled scientific discovery.

Abstract

Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. Code is at https://github.com/innovatingAI/AutoMind.

Paper Structure

This paper contains 40 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The framework of our AutoMind.
  • Figure 2: Test time scaling results on MLE-Bench. We record hourly snapshots of the percentage of human participants surpassed by the agent's best solution over a 24-hour time budget in experiments with DeepSeek-V3.
  • Figure 3: Abaltion studies on DeepSeek-V3 for AutoMind on the Medium split of MLE-Bench. Win Rate represents the percentage of human participants surpassed by the agent on the official leaderboard. Valid Rate represents the percentage of valid submissions among all solutions the agent makes within a 24-hour time budget.
  • Figure 4: A running case on the BELKA challenge. We compare the proposed solution plans and corresponding code implementations generated by both AIDE and AutoMind.