Table of Contents
Fetching ...

StatBot.Swiss: Bilingual Open Data Exploration in Natural Language

Farhad Nooralahzadeh, Yi Zhang, Ellery Smith, Sabine Maennel, Cyril Matthey-Doret, Raphaël de Fondville, Kurt Stockinger

TL;DR

StatBot.Swiss introduces the first bilingual Text-to-SQL benchmark by compiling 455 NL/SQL pairs over 35 real Swiss databases in English and German, derived from opendata.swiss. The authors evaluate GPT-3.5-Turbo-16k and Mixtral-8x7B-Instruct using in-context learning with varied exemplar selection, extending Spider hardness with an additional 'unknown' category to capture real-world query complexity. Results show that current LLMs achieve limited exact-match translation, with mean strict execution accuracy around the low tens in zero-shot and modest gains in few-shot prompts, while soft and partial metrics reveal partial progress toward user intents. The work highlights the need for improved multilingual prompting, larger and more diverse bilingual datasets, and future cross-lingual Text-to-SQL research, offering a robust baseline for bilingual real-world applications.

Abstract

The potential for improvements brought by Large Language Models (LLMs) in Text-to-SQL systems is mostly assessed on monolingual English datasets. However, LLMs' performance for other languages remains vastly unexplored. In this work, we release the StatBot.Swiss dataset, the first bilingual benchmark for evaluating Text-to-SQL systems based on real-world applications. The StatBot.Swiss dataset contains 455 natural language/SQL-pairs over 35 big databases with varying level of complexity for both English and German. We evaluate the performance of state-of-the-art LLMs such as GPT-3.5-Turbo and mixtral-8x7b-instruct for the Text-to-SQL translation task using an in-context learning approach. Our experimental analysis illustrates that current LLMs struggle to generalize well in generating SQL queries on our novel bilingual dataset.

StatBot.Swiss: Bilingual Open Data Exploration in Natural Language

TL;DR

StatBot.Swiss introduces the first bilingual Text-to-SQL benchmark by compiling 455 NL/SQL pairs over 35 real Swiss databases in English and German, derived from opendata.swiss. The authors evaluate GPT-3.5-Turbo-16k and Mixtral-8x7B-Instruct using in-context learning with varied exemplar selection, extending Spider hardness with an additional 'unknown' category to capture real-world query complexity. Results show that current LLMs achieve limited exact-match translation, with mean strict execution accuracy around the low tens in zero-shot and modest gains in few-shot prompts, while soft and partial metrics reveal partial progress toward user intents. The work highlights the need for improved multilingual prompting, larger and more diverse bilingual datasets, and future cross-lingual Text-to-SQL research, offering a robust baseline for bilingual real-world applications.

Abstract

The potential for improvements brought by Large Language Models (LLMs) in Text-to-SQL systems is mostly assessed on monolingual English datasets. However, LLMs' performance for other languages remains vastly unexplored. In this work, we release the StatBot.Swiss dataset, the first bilingual benchmark for evaluating Text-to-SQL systems based on real-world applications. The StatBot.Swiss dataset contains 455 natural language/SQL-pairs over 35 big databases with varying level of complexity for both English and German. We evaluate the performance of state-of-the-art LLMs such as GPT-3.5-Turbo and mixtral-8x7b-instruct for the Text-to-SQL translation task using an in-context learning approach. Our experimental analysis illustrates that current LLMs struggle to generalize well in generating SQL queries on our novel bilingual dataset.
Paper Structure (24 sections, 8 figures, 5 tables)

This paper contains 24 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Dataset distribution: (a) Left: Knowledge domains, (b) Right: Distribution of natural language/SQL-pairs over the train and development sets. EN = English, DE = German. The numbers on top of the bars denote the number of Text-to-SQL pairs.
  • Figure 2: Mean strict execution accuracy : zero-shot and few-shot for GPT-3.5 ($m = 5$) and Mixtral ($m = 6$) models using similarity-based selection where the number of examples are chosen to maximize $\text{EA}_\text{strict}$.
  • Figure 3: (Left) Strict execution accuracy ($\text{EA}_\text{strict}$) for each language. (Right) $\text{EA}_\text{strict}$ for each language per query hardness level. All metrics are computed on the development set for zero-shot and few-shot prompting strategies (6-shot in Mixtral, 5-shot in GPT-3.5).
  • Figure 4: Entity-relationship diagram of the knowledge domain criminal offences. spatial_unit is the dimension table and criminal_offenses_registered_by_police is the fact table. EN = English, NN stands for NOT NULL constraint. Note that the dimension table spatial_unit contains information about different levels of granularity and thus enables aggregating facts by, e.g. municipality, canton and country. However, note that not all facts contain information about all levels of granularity. For instance, some facts are only collected at municipality level while others are collected a cantonal level.
  • Figure 5: Entity-relationship diagram of the knowledge domain medizinisch_technische_infrastruktur [DE] (in Eng. medical technical infrastructure), where NN stands for NOT NULL constraint. Note that the dimension table raeumliche_einheit contains information about different levels of granularity and thus enables aggregating facts by, e.g. Gemeinde (in English: municipality), Kanton (in English: canton) and Land (in English: country). However, note that not all facts contain information about all levels of granularity. For instance, some facts are only collected at municipality level while others are collected at cantonal level.
  • ...and 3 more figures