BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain
Rahul Kumar, Amar Raja Dibbu, Shrutendra Harsola, Vignesh Subrahmaniam, Ashutosh Modi
TL;DR
BookSQL addresses the challenge of querying accounting databases via natural language by introducing a large, domain-focused Text-to-SQL dataset (100k NL-SQL pairs on a 1M-record accounting schema across 27 businesses) and an expert-driven annotation workflow. The authors benchmark multiple baselines, including SEDE, UniSAr, RESDSQL, and GPT-4–based prompting, revealing substantial gaps in domain generalization and stating the need for specialized models to handle time-based filters and nested SQL common in accounting queries. Through error analysis, they show that date handling, nested queries, and domain-specific constraints are major pain points, with RESDSQL offering the strongest baseline performance among existing models. The work provides a valuable resource to foster domain-aware modeling and demonstrates the practical challenges in building robust NL-to-SQL systems for finance and accounting in real-world settings.
Abstract
Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.
