AIDABench: AI Data Analytics Benchmark

Yibo Yang; Fei Lei; Yixuan Sun; Yantao Zeng; Chengguang Lv; Jiancao Hong; Jiaojiao Tian; Tianyu Qiu; Xin Wang; Yanbing Chen; Yanjie Li; Zheng Pan; Xiaochen Zhou; Guanzhou Chen; Haoran Lv; Yuning Xu; Yue Ou; Haodong Liu; Shiqi He; Anya Jia; Yulei Xin; Huan Wu; Liang Liu; Jiaye Ge; Jianxin Dong; Dahua Lin; Wenxiu Sun

AIDABench: AI Data Analytics Benchmark

Yibo Yang, Fei Lei, Yixuan Sun, Yantao Zeng, Chengguang Lv, Jiancao Hong, Jiaojiao Tian, Tianyu Qiu, Xin Wang, Yanbing Chen, Yanjie Li, Zheng Pan, Xiaochen Zhou, Guanzhou Chen, Haoran Lv, Yuning Xu, Yue Ou, Haodong Liu, Shiqi He, Anya Jia, Yulei Xin, Huan Wu, Liang Liu, Jiaye Ge, Jianxin Dong, Dahua Lin, Wenxiu Sun

Abstract

As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.

AIDABench: AI Data Analytics Benchmark

Abstract

Paper Structure (40 sections, 2 equations, 10 figures, 4 tables)

This paper contains 40 sections, 2 equations, 10 figures, 4 tables.

Introduction
Related Works
Spreadsheet and Tabular Manipulation
Document Understanding and PDF Analytics
Agentic Data Intelligence and Professional Workflows
Methods
Dataset Construction
Task Design and Quality Control
Dataset Statistics
Inference Protocol
Evaluator Design
QA Evaluator
Visualization Evaluator
Spreadsheet File Evaluator
Evaluator Selection and Calibration
...and 25 more sections

Figures (10)

Figure 1: Overview of the AIDABench evaluation framework. The workflow illustrates the pipeline from multi-format Raw Data ingestion, through Intermediate Processing (encompassing the capability dimensions of data editing/transformation and numerical/statistical reasoning in a non-sequential flow), to the final Delivery Result (corresponding to the QA, data visualization, and file generation dimension).
Figure 2: Three example evaluation scenarios in the benchmark. (a) QA: answer users’ data analysis questions based on the provided data; (b) Data Visualization: create visualizations based on users’ questions and the provided data. Each scenario shows the corresponding Query and the expected Reference output format; (c) File Generation: generate spreadsheets according to users’ requirements.
Figure 3: The design of three types of evaluators in AIDABench.
Figure 4: Error composition within bad cases across scenarios. Heatmaps report, for each model and scenario (QA, Data Visualization, and File Generation), the percentage share of each error type among the bad cases. Redder colors indicate higher shares. Model names are shown on the left panel, and a single color scale is shared across all panels.
Figure 5: Example of auxiliary spreadsheet summarization input: screenshot of the workbook Revenue_Calculation_Details.xlsx (active sheet excerpt).
...and 5 more figures

AIDABench: AI Data Analytics Benchmark

Abstract

AIDABench: AI Data Analytics Benchmark

Authors

Abstract

Table of Contents

Figures (10)