A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, Yuyu Luo
TL;DR
This work addresses terminological ambiguity in data agents by proposing a six-level autonomy taxonomy (L0–L5) inspired by SAE J3016 and organizing a structured review of data-management, data-preparation, and data-analysis literature by autonomy level. It details how L1 introduces stateless preliminary assistance, L2 adds perception and interaction with data environments, and how proto-L3 efforts aim at autonomous orchestration under supervision, with L4–L5 envisioned as proactive, fully autonomous, and generative capabilities. The key contributions are the taxonomy itself, a level-based literature synthesis, an analysis of evolutionary gaps (notably L2→L3), and a forward roadmap toward proactive and generative data agents. The paper advances governance and benchmarking in the data-agent space, outlining practical steps toward scalable autonomy and reduced human intervention in data ecosystems. The findings suggest a staged trajectory from augmented assistance to autonomous data scientists, with significant emphasis on safe, long-horizon planning and cross-l lifecycle capabilities for real-world deployment.
Abstract
The rapid advancement of large language models (LLMs) has spurred the emergence of data agents--autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks. However, the term "data agent" currently suffers from terminological ambiguity and inconsistent adoption, conflating simple query responders with sophisticated autonomous architectures. This terminological ambiguity fosters mismatched user expectations, accountability challenges, and barriers to industry growth. Inspired by the SAE J3016 standard for driving automation, this survey introduces the first systematic hierarchical taxonomy for data agents, comprising six levels that delineate and trace progressive shifts in autonomy, from manual operations (L0) to a vision of generative, fully autonomous data agents (L5), thereby clarifying capability boundaries and responsibility allocation. Through this lens, we offer a structured review of existing research arranged by increasing autonomy, encompassing specialized data agents for data management, preparation, and analysis, alongside emerging efforts toward versatile, comprehensive systems with enhanced autonomy. We further analyze critical evolutionary leaps and technical gaps for advancing data agents, especially the ongoing L2-to-L3 transition, where data agents evolve from procedural execution to autonomous orchestration. Finally, we conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
