Table of Contents
Fetching ...

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz

Abstract

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Abstract

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

Paper Structure

This paper contains 61 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Distribution of error attribution across 81 failed transformation tasks in ELT-Bench. Tasks are classified by error source---agent-attributable, benchmark-attributable, or mixed---and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal.
  • Figure 2: Overview of a typical ELT pipeline. Data is first extracted from heterogeneous sources---such as databases, APIs, flat files---using connectors like Airbyte airbyte. It is then loaded in raw form into a cloud data warehouse such as Snowflake snowflake. Finally, transformation logic, typically written as SQL models in dbt dbt, reshapes the raw data into analytics-ready tables within the warehouse.
  • Figure 3: Agent workflow in ELT-Bench, illustrating the two-stage pipeline the agent must construct. EL Stage: The agent reads connection details from config.yaml, writes Terraform configurations to define Airbyte airbyte sources, the Snowflake snowflake destination, and their connectors, deploys the configuration, and triggers synchronization jobs that extract data from heterogeneous sources (databases, APIs, flat files) and load it into the warehouse as staging tables. The agent then uses a provided Python script in the workspace to poll job status and proceeds to the T stage once all jobs succeed (i.e., all source tables for the task have been loaded into Snowflake). T Stage: The agent initializes a dbt dbt project, authors SQL transformation models that reshape the staged data into the target data models defined in data_model_schema.yaml, and executes dbt run, which generates the target data model(s) in the Snowflake data warehouse.
  • Figure 4: Pipeline outcomes across extraction/loading and transformation stages.
  • Figure 5: Overview of the Auditor-Corrector framework. The Auditor proceeds in three phases: Phase 1 constructs a structured analysis environment for each failed task from the evaluation log; Phase 2 deploys an LLM agent per task that autonomously reverse-engineers the correct SQL for each unmatched column, verifying each derivation against the ground truth; Phase 3 involves manual categorization of all 660 reports into an error taxonomy. The Corrector then applies category-specific corrections---evaluation script refinements and removal of columns with unreliable ground truth---to produce ELT-Bench-Verified.
  • ...and 1 more figures