Table of Contents
Fetching ...

Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed, May Alsofyani, Saad Almohaimeed, Mansour Al Ghanim, Liqiang Wang

TL;DR

This work addresses the lack of Arabic context-dependent text-to-SQL resources by introducing Ar-SParC, a cross-domain dataset with 3,450 question sequences (≈10,225 questions) across 160 databases and 116 domains. It systematically evaluates 40 experiments using GPT-3.5-Turbo and GPT-4.5-Turbo with diverse prompt engineering—covering question representations, in-context learning, and the novel GAT-Corrector, which both detects and corrects SQL errors in one step. The results show consistent, modest improvements in $EX$ and $IX$ metrics (roughly 1.9% zero-shot; 1.72%$EX$ and 0.92%$IX$ in ICL), and ablation demonstrates GAT-Corrector’s superiority over GAT-Verifier for Arabic. Overall, the paper advances Arabic semantic parsing and highlights the importance of language-specific prompt engineering to improve cross-domain text-to-SQL systems for low-resource languages.

Abstract

In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

TL;DR

This work addresses the lack of Arabic context-dependent text-to-SQL resources by introducing Ar-SParC, a cross-domain dataset with 3,450 question sequences (≈10,225 questions) across 160 databases and 116 domains. It systematically evaluates 40 experiments using GPT-3.5-Turbo and GPT-4.5-Turbo with diverse prompt engineering—covering question representations, in-context learning, and the novel GAT-Corrector, which both detects and corrects SQL errors in one step. The results show consistent, modest improvements in and metrics (roughly 1.9% zero-shot; 1.72% and 0.92% in ICL), and ablation demonstrates GAT-Corrector’s superiority over GAT-Verifier for Arabic. Overall, the paper advances Arabic semantic parsing and highlights the importance of language-specific prompt engineering to improve cross-domain text-to-SQL systems for low-resource languages.

Abstract

In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

Paper Structure

This paper contains 14 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Examples of questions paired with their corresponding SQL queries in Ar-SParC. The first question is at the top, while the conversation progresses to the final question at the bottom.
  • Figure 2: The process on the left shows how the GAT verifier works. First, the user's prompt is input into the LLM, which generates an SQL query. This query, along with the user question and schema, is then input into the GAT verifier to detect errors. The verifier outputs the result, which is then fed back to the LLM to correct any mistakes. On the other hand, The process on the right illustrates how the GAT corrector works, similar to the GAT verifier. However, instead of just detecting errors, the GAT corrector detects and corrects the errors at the same time.