Table of Contents
Fetching ...

LLM/Agent-as-Data-Analyst: A Survey

Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Dayou Zhou, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Xue Yang, Chunwei Liu, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu

TL;DR

The paper addresses how LLMs and agent techniques reshape data analysis across structured, semi-structured, unstructured, and heterogeneous data modalities. It presents a two-dimensional taxonomy mapping data modalities to interaction paradigms (code-based, DSL-based, NL-based) and identifies four design goals: semantic-aware design, autonomous pipelines, tool-augmented workflows, and open-world task support. The survey provides detailed method-level analyses, data-curation considerations, and industry practices, highlighting recent advances, challenges, and practical directions. Together, it offers guidance for researchers and practitioners aiming to build general-purpose LLM/Agent-powered data-analysis systems.

Abstract

Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks (a.k.a LLM/Agent-as-Data-Analyst), demonstrating substantial impact across both academia and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., NL2SQL, NL2GQL, ModelQA), (ii) semi-structured data (e.g., markup languages understanding, semi-structured table question answering), (iii) unstructured data (e.g., chart understanding, text/image document understanding), and (iv) heterogeneous data (e.g., data retrieval and modality alignment in data lakes). The technical evolution further distills four key design goals for intelligent data analysis agents, namely semantic-aware design, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.

LLM/Agent-as-Data-Analyst: A Survey

TL;DR

The paper addresses how LLMs and agent techniques reshape data analysis across structured, semi-structured, unstructured, and heterogeneous data modalities. It presents a two-dimensional taxonomy mapping data modalities to interaction paradigms (code-based, DSL-based, NL-based) and identifies four design goals: semantic-aware design, autonomous pipelines, tool-augmented workflows, and open-world task support. The survey provides detailed method-level analyses, data-curation considerations, and industry practices, highlighting recent advances, challenges, and practical directions. Together, it offers guidance for researchers and practitioners aiming to build general-purpose LLM/Agent-powered data-analysis systems.

Abstract

Large language models (LLMs) and agent techniques have brought a fundamental shift in the functionality and development paradigm of data analysis tasks (a.k.a LLM/Agent-as-Data-Analyst), demonstrating substantial impact across both academia and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., NL2SQL, NL2GQL, ModelQA), (ii) semi-structured data (e.g., markup languages understanding, semi-structured table question answering), (iii) unstructured data (e.g., chart understanding, text/image document understanding), and (iv) heterogeneous data (e.g., data retrieval and modality alignment in data lakes). The technical evolution further distills four key design goals for intelligent data analysis agents, namely semantic-aware design, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.

Paper Structure

This paper contains 24 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 2: Technical Overview of LLM/Agent-as-Data-Analyst. The five key design goals are illustrated in the center of the figure using distinct colors. The colored icons next to each technique indicate the specific design goal it supports.
  • Figure 3: LLM for Structured Data Analysis - (a) Pipeline Method. (b) End-to-End Method.
  • Figure 4: LLM for Markup Data Analysis.
  • Figure 5: Example Characters of Semi-Structured Tables.
  • Figure 6: LLM for Semi-Structured Table Analysis.
  • ...and 3 more figures