Table of Contents
Fetching ...

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing Chen

TL;DR

It is found that providing agents with a structured document representation produced by Databricks'ai_parse_document yields a 16.1% average relative performance gain across agents, and significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

Abstract

We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

TL;DR

It is found that providing agents with a structured document representation produced by Databricks'ai_parse_document yields a 16.1% average relative performance gain across agents, and significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

Abstract

We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
Paper Structure (39 sections, 1 equation, 17 figures, 8 tables)

This paper contains 39 sections, 1 equation, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Agent performance on OfficeQA Pro. Each agent framework—Google Agent (Gemini CLI), OpenAI Agent (Codex CLI), and Anthropic Agent (Claude Agent SDK)—is evaluated with the provider's latest flagship model, along with the flagship model available at the time OfficeQA was created. Agent performance with latest frontier models is 34.1% on average and none of the agents surpass 50% accuracy.
  • Figure 2: Representative pages from the OfficeQA Pro corpus, illustrating how Treasury Bulletins vary in formatting and layout across decades. Left: March 1941 customs duties and taxable imports by tariff schedule with CY1939–1940 aggregates and monthly values (thousands of dollars). Right: March 1980 Treasury Survey of Ownership of interest-bearing marketable public debt securities by type, maturity, and issue across investor classes (millions of dollars).
  • Figure 3: Sample questions from OfficeQA Pro. Question UID0013 (left) requires information from 1 bulletin, and linear regression analysis. Question UID0005 (right) requires information from 2 separate bulletins, multi-step math, and web-search to retrieve the correct CPI value.
  • Figure 4: Overview of the benchmark creation process. In Phase 1, the question is created and initially verified by a new annotator. In Phase 2, two passes of end-to-end quality assurance are completed using AI agents to produce alternative answers, which are used to inform human verification of both the question and the ground truth answer.
  • Figure 5: Performance of frontier LLMs on OfficeQA Pro across three configurations: Prompt Only, Web Search Enabled, and Oracle Page(s) Provided + Web Search. Results are reported as correctness (%) across allowable absolute relative error thresholds of 5.0%, 1.0%, 0.1%, and 0.0% for the three models. The Oracle configuration denotes the original documents as "PDF" while "Databricks Parsed" refers to parsing via Databricks' https://www.databricks.com/blog/pdfs-production-announcing-state-art-document-intelligence-databricks. Leveraging the Databricks' parsed pages provides a relative gain of +50.2% on average at 0.0% allowable absolute relative error.
  • ...and 12 more figures