Table of Contents
Fetching ...

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

Ruiying Ma, Shreya Shankar, Ruiqi Chen, Yiming Lin, Sepanta Zeighami, Rajoshi Ghosh, Abhinav Gupta, Anushrut Gupta, Tanmai Gopal, Aditya G. Parameswaran

Abstract

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem -- e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context -- but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

Abstract

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous database systems, with inconsistent references and information buried in unstructured text. Existing benchmarks only tackle individual pieces of this problem -- e.g., translating natural-language questions into SQL queries, answering questions over small tables provided in context -- but do not evaluate the full pipeline of integrating, transforming, and analyzing data across multiple database systems. To fill this gap, we present the Data Agent Benchmark (DAB), grounded in a formative study of enterprise data agent workloads across six industries. DAB comprises 54 queries across 12 datasets, 9 domains, and 4 database management systems. On DAB, the best frontier model (Gemini-3-Pro) achieves only 38% pass@1 accuracy. We benchmark five frontier LLMs, analyze their failure modes, and distill takeaways for future data agent development. Our benchmark and experiment code are published at github.com/ucbepic/DataAgentBench.
Paper Structure (37 sections, 1 equation, 7 figures, 8 tables)

This paper contains 37 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (a) In DAB, an agent solves a user task by interacting with database querying and Python execution tools within a ReAct-style loop. (b) In this example, the agent operates over unstructured text (i.e., extracting language from the details column in the 3rd tool call) and integrates data across different databases (PostgreSQL and SQLite) by reconciling the ill-formatted join keys (i.e., bref and bid, in the 5th tool call).
  • Figure 2: Dataset creation methodology, illustrated on the bookreview dataset. Step 1: collect an open-source dataset with two tables, books_info and reviews. Step 2: transform the data by removing publishedDate and publisher and re-embedding their values into a new details column via LLM-generated sentences, and by prefixing the join keys (id$\to$bid, book_id$\to$bref). Step 3: distribute the tables across PostgreSQL and SQLite. Step 4: create a dataset description (descriptions.txt) and a hints file (hints.txt).
  • Figure 3: Stylized summary of the prompt structure. The system prompt is shared across all queries; the user prompt is instantiated per query. Database descriptions and hints are dataset-level and remain unchanged across queries within the same dataset (see \ref{['sec:bench-constr-method']}). Full templates are in \ref{['append-sec:prompt-templates']}.
  • Figure 4: Cost (USD, log scale) vs. pass@1 accuracy. GPT-5-mini achieves the best cost-accuracy tradeoff; Gemini-3-Pro leads in accuracy at $20\times$ the cost.
  • Figure 5: Pass@$k$ as a function of $k$ (number of attempts). Agent rankings remain stable across all $k$; even at $k{=}50$, the best agent does not exceed 69%.
  • ...and 2 more figures