Table of Contents
Fetching ...

Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo

TL;DR

CORGI addresses the gap between standard text-to-SQL benchmarks and real-world business intelligence by introducing a business-domain benchmark with synthetic databases across multiple verticals and four query types (descriptive, explanatory, predictive, recommendational). It pairs this with a holistic, consulting-inspired evaluation framework and two evaluation mechanisms, including an atomized multi-agent system and the CORGI Online platform for human judgments, to assess structure, reasoning, and implementability beyond SQL correctness. Across experiments with Gemini and GPT-4o, LLMs show notable declines on high-level questions, struggle to produce actionable, implementable recommendations, and perform significantly worse than BI benchmarks like BIRD in execution success. The work provides public data, tools, and a submission platform to accelerate research on LLMs for business decision support and suggests directions for improving high-level reasoning, forecasting, and prescriptive capabilities in enterprise contexts.

Abstract

In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.

Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

TL;DR

CORGI addresses the gap between standard text-to-SQL benchmarks and real-world business intelligence by introducing a business-domain benchmark with synthetic databases across multiple verticals and four query types (descriptive, explanatory, predictive, recommendational). It pairs this with a holistic, consulting-inspired evaluation framework and two evaluation mechanisms, including an atomized multi-agent system and the CORGI Online platform for human judgments, to assess structure, reasoning, and implementability beyond SQL correctness. Across experiments with Gemini and GPT-4o, LLMs show notable declines on high-level questions, struggle to produce actionable, implementable recommendations, and perform significantly worse than BI benchmarks like BIRD in execution success. The work provides public data, tools, and a submission platform to accelerate research on LLMs for business decision support and suggests directions for improving high-level reasoning, forecasting, and prescriptive capabilities in enterprise contexts.

Abstract

In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.

Paper Structure

This paper contains 62 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: We incorporate business logic into our synthesized database population process, and propose a multi-agent evaluation framework consisting of a discriminator agent and seven scoring agents, each responsible for evaluating one aspect of the generated answer.
  • Figure 2: Illustrations of data simulation rules without compounding effects, using the Persona Nutrition database as a case study.
  • Figure 3: Business questions range in complexity from simple fact retrieval to speculative business strategy.
  • Figure 4: Comparison between our proposed atomized multi-agent evaluation mechanism and single LLM evaluation. S: Structure, D: Data Sense, I: Insightfulness, O: Operation Implementability, P: Purpose Alignment, C: Compliance. The dashed lines indicate overall average scores.
  • Figure 5: LLMs SQL Query Execution Performance Comparison Across Question Types.
  • ...and 1 more figures