Table of Contents
Fetching ...

BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain

Rahul Kumar, Amar Raja Dibbu, Shrutendra Harsola, Vignesh Subrahmaniam, Ashutosh Modi

TL;DR

BookSQL addresses the challenge of querying accounting databases via natural language by introducing a large, domain-focused Text-to-SQL dataset (100k NL-SQL pairs on a 1M-record accounting schema across 27 businesses) and an expert-driven annotation workflow. The authors benchmark multiple baselines, including SEDE, UniSAr, RESDSQL, and GPT-4–based prompting, revealing substantial gaps in domain generalization and stating the need for specialized models to handle time-based filters and nested SQL common in accounting queries. Through error analysis, they show that date handling, nested queries, and domain-specific constraints are major pain points, with RESDSQL offering the strongest baseline performance among existing models. The work provides a valuable resource to foster domain-aware modeling and demonstrates the practical challenges in building robust NL-to-SQL systems for finance and accounting in real-world settings.

Abstract

Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.

BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain

TL;DR

BookSQL addresses the challenge of querying accounting databases via natural language by introducing a large, domain-focused Text-to-SQL dataset (100k NL-SQL pairs on a 1M-record accounting schema across 27 businesses) and an expert-driven annotation workflow. The authors benchmark multiple baselines, including SEDE, UniSAr, RESDSQL, and GPT-4–based prompting, revealing substantial gaps in domain generalization and stating the need for specialized models to handle time-based filters and nested SQL common in accounting queries. Through error analysis, they show that date handling, nested queries, and domain-specific constraints are major pain points, with RESDSQL offering the strongest baseline performance among existing models. The work provides a valuable resource to foster domain-aware modeling and demonstrates the practical challenges in building robust NL-to-SQL systems for finance and accounting in real-world settings.

Abstract

Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.
Paper Structure (29 sections, 4 figures, 14 tables)

This paper contains 29 sections, 4 figures, 14 tables.

Figures (4)

  • Figure 1: BookSQL Database schema
  • Figure 2: An example showing the pipeline for creating BookSQL dataset. Note, here we can replace aggregation_entity by max, min, total, and average, and customer_name can be replaced with any possible name to get the Question-SQL pair. Similarly, date/period can be replaced with last quarter, this quarter, last month.
  • Figure 3: Sample BookSQL Business Distribution. The middle section shows the sample set of businesses, inner section shows the industries associated with the corresponding business and outer most section shows the corresponding product of the business. This chart is made with the information available at: https://www.ibisworld.com/united-states/list-of-industries/.
  • Figure 4: BookSQL Business Distribution. Here, inner circle indicates the industries , middle circle shows the sets of businesses associated to respective industry , and the outer most circle indicate corresponding product of the business. This chart is made with the information available at: https://www.ibisworld.com/united-states/list-of-industries/.