Table of Contents
Fetching ...

Ar-Spider: Text-to-SQL in Arabic

Saleh Almohaimeed, Saad Almohaimeed, Mansour Al Ghanim, Liqiang Wang

TL;DR

Ar-Spider introduces the first Arabic cross-domain text-to-SQL dataset, addressing schema linguistic and SQL structural challenges inherent to Arabic. By evaluating LGESQL and S2SQL with cross-lingual encoders and proposing Context Similarity Relationships (CSR) based on LASER embeddings, the study significantly improves Arabic parsing performance and reduces the English-Arabic gap to 7.73%. CSR consistently enhances baseline models, achieving a top exact-match accuracy of 66.63% on Ar-Spider with LGESQL + XLM-R + CSR. The work demonstrates the potential of cross-lingual bridging techniques for non-English semantic parsing and highlights that naive combining of English and Arabic data does not always yield gains, guiding future research on cross-language schema linking and morphology-aware adaptation.

Abstract

In Natural Language Processing (NLP), one of the most important tasks is text-to-SQL semantic parsing, which focuses on enabling users to interact with the database in a more natural manner. In recent years, text-to-SQL has made significant progress, but most were English-centric. In this paper, we introduce Ar-Spider 1, the first Arabic cross-domain text-to-SQL dataset. Due to the unique nature of the language, two major challenges have been encountered, namely schema linguistic and SQL structural challenges. In order to handle these issues and conduct the experiments, we adopt two baseline models LGESQL [4] and S2SQL [12], both of which are tested with two cross-lingual models to alleviate the effects of schema linguistic and SQL structure linking challenges. The baselines demonstrate decent single-language performance on our Arabic text-to-SQL dataset, Ar-Spider, achieving 62.48% for S2SQL and 65.57% for LGESQL, only 8.79% below the highest results achieved by the baselines when trained in English dataset. To achieve better performance on Arabic text-to-SQL, we propose the context similarity relationship (CSR) approach, which results in a significant increase in the overall performance of about 1.52% for S2SQL and 1.06% for LGESQL and closes the gap between Arabic and English languages to 7.73%.

Ar-Spider: Text-to-SQL in Arabic

TL;DR

Ar-Spider introduces the first Arabic cross-domain text-to-SQL dataset, addressing schema linguistic and SQL structural challenges inherent to Arabic. By evaluating LGESQL and S2SQL with cross-lingual encoders and proposing Context Similarity Relationships (CSR) based on LASER embeddings, the study significantly improves Arabic parsing performance and reduces the English-Arabic gap to 7.73%. CSR consistently enhances baseline models, achieving a top exact-match accuracy of 66.63% on Ar-Spider with LGESQL + XLM-R + CSR. The work demonstrates the potential of cross-lingual bridging techniques for non-English semantic parsing and highlights that naive combining of English and Arabic data does not always yield gains, guiding future research on cross-language schema linking and morphology-aware adaptation.

Abstract

In Natural Language Processing (NLP), one of the most important tasks is text-to-SQL semantic parsing, which focuses on enabling users to interact with the database in a more natural manner. In recent years, text-to-SQL has made significant progress, but most were English-centric. In this paper, we introduce Ar-Spider 1, the first Arabic cross-domain text-to-SQL dataset. Due to the unique nature of the language, two major challenges have been encountered, namely schema linguistic and SQL structural challenges. In order to handle these issues and conduct the experiments, we adopt two baseline models LGESQL [4] and S2SQL [12], both of which are tested with two cross-lingual models to alleviate the effects of schema linguistic and SQL structure linking challenges. The baselines demonstrate decent single-language performance on our Arabic text-to-SQL dataset, Ar-Spider, achieving 62.48% for S2SQL and 65.57% for LGESQL, only 8.79% below the highest results achieved by the baselines when trained in English dataset. To achieve better performance on Arabic text-to-SQL, we propose the context similarity relationship (CSR) approach, which results in a significant increase in the overall performance of about 1.52% for S2SQL and 1.06% for LGESQL and closes the gap between Arabic and English languages to 7.73%.
Paper Structure (20 sections, 1 figure, 7 tables)

This paper contains 20 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: An illustration of the overall model architecture. There are three types of relations between nodes in the graph. Dot-line represents question-table-cosine-matches or question-column-cosine-matches. Bold-line represents question structure relationships between the nodes of the question tokens. Straight-line indicates schema structure relationships, such as primary and foreign keys. In the figure, the top right subgraph shows how question schema relations were created before CSR.