Table of Contents
Fetching ...

A Survey on Large Language Model-based Agents for Statistics and Data Science

Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, Jian Huang

TL;DR

The survey surveys how large language model–powered data agents lower barriers to statistical analysis by automating data tasks through natural language, planning, reasoning, and multi-agent collaboration within sandboxed environments. It details architectures, UI paradigms, and knowledge integration, and presents case studies that demonstrate EDA, modeling, diagnostics, and uncertainty quantification performed by agents. It also critiques current limitations—model capability, multi-modality, reproducibility, and real-world adoption—while outlining benchmarks and future directions toward more capable, extensible statistical software. Overall, the work maps a trajectory from prototype agents to integrated, user-friendly systems that can augment domain experts and nonexpert users alike in data-driven decision making.

Abstract

In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.

A Survey on Large Language Model-based Agents for Statistics and Data Science

TL;DR

The survey surveys how large language model–powered data agents lower barriers to statistical analysis by automating data tasks through natural language, planning, reasoning, and multi-agent collaboration within sandboxed environments. It details architectures, UI paradigms, and knowledge integration, and presents case studies that demonstrate EDA, modeling, diagnostics, and uncertainty quantification performed by agents. It also critiques current limitations—model capability, multi-modality, reproducibility, and real-world adoption—while outlining benchmarks and future directions toward more capable, extensible statistical software. Overall, the work maps a trajectory from prototype agents to integrated, user-friendly systems that can augment domain experts and nonexpert users alike in data-driven decision making.

Abstract

In recent years, data science agents powered by Large Language Models (LLMs), known as "data agents," have shown significant potential to transform the traditional data analysis paradigm. This survey provides an overview of the evolution, capabilities, and applications of LLM-based data agents, highlighting their role in simplifying complex data tasks and lowering the entry barrier for users without related expertise. We explore current trends in the design of LLM-based frameworks, detailing essential features such as planning, reasoning, reflection, multi-agent collaboration, user interface, knowledge integration, and system design, which enable agents to address data-centric problems with minimal human intervention. Furthermore, we analyze several case studies to demonstrate the practical applications of various data agents in real-world scenarios. Finally, we identify key challenges and propose future research directions to advance the development of data agents into intelligent statistical analysis software.

Paper Structure

This paper contains 29 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: New paradigm of data analysis brought by generative AI.
  • Figure 2: Timeline of selected related works from 2023.
  • Figure 3: An architecture of an LLM-based data agent. The diagram illustrates the interaction between LLMs and a sandbox environment. On the left, key components of LLMs are highlighted, including User Interface, Planning, Reasoning, Reflection, and Error Handling. The sandbox, positioned centrally, serves as a controlled environment for executing task codes and generating results. On the right, various tools and software that can be pre-installed in the sandbox, such as Python, SQL, Jupyter, and R, indicate the diverse ecosystems where LLM-powered agents can operate.
  • Figure 4: Commonly used planning and reasoning strategies in LLM-based data agents for organizing tasks or solving problems. Each node represents a sub-task in the roadmap.
  • Figure 5: Partial dialogue from the ChatGPT-Advanced Data Analysis in Case Study 1. Items 1-4 list the work done by ChatGPT in each step.
  • ...and 9 more figures