Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain
Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo
TL;DR
CORGI addresses the gap between standard text-to-SQL benchmarks and real-world business intelligence by introducing a business-domain benchmark with synthetic databases across multiple verticals and four query types (descriptive, explanatory, predictive, recommendational). It pairs this with a holistic, consulting-inspired evaluation framework and two evaluation mechanisms, including an atomized multi-agent system and the CORGI Online platform for human judgments, to assess structure, reasoning, and implementability beyond SQL correctness. Across experiments with Gemini and GPT-4o, LLMs show notable declines on high-level questions, struggle to produce actionable, implementable recommendations, and perform significantly worse than BI benchmarks like BIRD in execution success. The work provides public data, tools, and a submission platform to accelerate research on LLMs for business decision support and suggests directions for improving high-level reasoning, forecasting, and prescriptive capabilities in enterprise contexts.
Abstract
In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
