Table of Contents
Fetching ...

ChatBI: Towards Natural Language to Complex Business Intelligence SQL

Jinqing Lian, Xinyi Liu, Yingxia Shao, Yang Dong, Ming Wang, Zhang Wei, Tianqi Wan, Ming Dong, Hailin Yan

TL;DR

ChatBI addresses the NL2BI challenge by introducing a production-oriented, multi-module approach that separates interaction handling, schema linking, and SQL generation. It decomposes the problem into Multi-Round Dialogue Matching, Single View Selection, and a Phased Process Flow with virtual columns, leveraging lightweight models and database-view techniques to cope with hundreds of BI columns and ambiguous metrics. The system outputs JSON intermediates that are then translated into SQL via rule-based templates, greatly reducing token load and improving accuracy in complex semantics and comparisons. Deployed on Baidu's data platform, ChatBI demonstrates superior practicality and efficiency compared to mainstream NL2SQL methods, with notable gains in usefulness execution accuracy and token economy in both SRD and MRD settings.

Abstract

The Natural Language to SQL (NL2SQL) technology provides non-expert users who are unfamiliar with databases the opportunity to use SQL for data analysis.Converting Natural Language to Business Intelligence (NL2BI) is a popular practical scenario for NL2SQL in actual production systems. Compared to NL2SQL, NL2BI introduces more challenges. In this paper, we propose ChatBI, a comprehensive and efficient technology for solving the NL2BI task. First, we analyze the interaction mode, an important module where NL2SQL and NL2BI differ in use, and design a smaller and cheaper model to match this interaction mode. In BI scenarios, tables contain a huge number of columns, making it impossible for existing NL2SQL methods that rely on Large Language Models (LLMs) for schema linking to proceed due to token limitations. The higher proportion of ambiguous columns in BI scenarios also makes schema linking difficult. ChatBI combines existing view technology in the database community to first decompose the schema linking problem into a Single View Selection problem and then uses a smaller and cheaper machine learning model to select the single view with a significantly reduced number of columns. The columns of this single view are then passed as the required columns for schema linking into the LLM. Finally, ChatBI proposes a phased process flow different from existing process flows, which allows ChatBI to generate SQL containing complex semantics and comparison relations more accurately. We have deployed ChatBI on Baidu's data platform and integrated it into multiple product lines for large-scale production task evaluation. The obtained results highlight its superiority in practicality, versatility, and efficiency. At the same time, compared with the current mainstream NL2SQL technology under our real BI scenario data tables and queries, it also achieved the best results.

ChatBI: Towards Natural Language to Complex Business Intelligence SQL

TL;DR

ChatBI addresses the NL2BI challenge by introducing a production-oriented, multi-module approach that separates interaction handling, schema linking, and SQL generation. It decomposes the problem into Multi-Round Dialogue Matching, Single View Selection, and a Phased Process Flow with virtual columns, leveraging lightweight models and database-view techniques to cope with hundreds of BI columns and ambiguous metrics. The system outputs JSON intermediates that are then translated into SQL via rule-based templates, greatly reducing token load and improving accuracy in complex semantics and comparisons. Deployed on Baidu's data platform, ChatBI demonstrates superior practicality and efficiency compared to mainstream NL2SQL methods, with notable gains in usefulness execution accuracy and token economy in both SRD and MRD settings.

Abstract

The Natural Language to SQL (NL2SQL) technology provides non-expert users who are unfamiliar with databases the opportunity to use SQL for data analysis.Converting Natural Language to Business Intelligence (NL2BI) is a popular practical scenario for NL2SQL in actual production systems. Compared to NL2SQL, NL2BI introduces more challenges. In this paper, we propose ChatBI, a comprehensive and efficient technology for solving the NL2BI task. First, we analyze the interaction mode, an important module where NL2SQL and NL2BI differ in use, and design a smaller and cheaper model to match this interaction mode. In BI scenarios, tables contain a huge number of columns, making it impossible for existing NL2SQL methods that rely on Large Language Models (LLMs) for schema linking to proceed due to token limitations. The higher proportion of ambiguous columns in BI scenarios also makes schema linking difficult. ChatBI combines existing view technology in the database community to first decompose the schema linking problem into a Single View Selection problem and then uses a smaller and cheaper machine learning model to select the single view with a significantly reduced number of columns. The columns of this single view are then passed as the required columns for schema linking into the LLM. Finally, ChatBI proposes a phased process flow different from existing process flows, which allows ChatBI to generate SQL containing complex semantics and comparison relations more accurately. We have deployed ChatBI on Baidu's data platform and integrated it into multiple product lines for large-scale production task evaluation. The obtained results highlight its superiority in practicality, versatility, and efficiency. At the same time, compared with the current mainstream NL2SQL technology under our real BI scenario data tables and queries, it also achieved the best results.
Paper Structure (16 sections, 4 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 4 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: The difference between NL2SQL and NL2BI problems. The question consists of multiple rounds of dialogue, Query1: "The short video playback volume for the past seven days", Query2: "What about the week-on-week comparison?", Query3: "What about the playback duration?". The NL2BI problem requires the ability to handle complex semantics, comparisons and calculation relationships, as well as multi-round dialogue. And the data sources for the two problems are also different.
  • Figure 2: An example of metadata in NL2SQL and NL2BI. "dau" and "uv" are Virtual Columns.
  • Figure 3: The Overview of ChatBI.
  • Figure 4: The method used in MRD Matching.
  • Figure 5: The prompt for SFT training data generation using GPT-4.
  • ...and 2 more figures